Database Structure

As was mentioned in the introduction, the process is split into four phases:

Extracting the information about each crate.
Merging the information into a single database.
Running queries over the database.
Analysing the query results.

This design was motivated by the fact that the extraction phase is time consuming (can take up to a week for the entire crates.io) and depends on internal compiler APIs that are unstable and change often: we decided to make the extractor as minimal as possible and handle the main workload in the later phases. This is also reflected in the Datalog database schema by splitting into two parts:

database/src/schema.dl – core data, which is generated by the extractor. Modifications made to this file require rerunning the extractor.
database/src/derived.dl – derived data, or, in other words, data computed by the queries and stored in the database so that it can be reused by other queries.

The following subsections present how the data is stored in the database and present some of the most important derived queries.

Database Format

The database is inspired by Nicholas D. Matsakis blog post in which he suggested to make the Rust compiler to output Datalog which then could be used to write interesting queries. schema.dl defines three kinds of elements:

Types – we decided to use a strongly typed variant of Datalog; therefore, most elements have unique types.
Interning tables – we use them for two main reasons:
1. To reduce the memory requirements by replacing complex objects such as strings with unique integers and using them instead.
2. To map a specific type such as Package to a generic type such as InternedString.
Datalog relations – that similarly to SQL tables capture the relational information between program elements.

derived.dl can define additional relations.

From schema.dl and derived.dl, a procedural macro generates the code that manages the database. Most importantly, it generates the Tables object that is used by the extractor to store the extracted data and the Loader object that is used by the queries to load the data.

Fundamental Derived Queries

While, in theory, the core schema should be self-contained, the fact that changes to it require rerunning the extractor made us to put some of the fundamental relations to derived.dl. The most important of such relations is selected_builds and many other selected_* relations that are derived from it. To understand the purpose of the selected_builds relation, we need first to explain the differences between packages, crates, and builds:

Packages are archives stored on crates.io. Packages have versions and their names are unique within a registry. A single package can define one or more crate. Packages can also define feature flags that allow customizing compilation by, for example, including some functions only if the specific feature is enabled.
A crate is a unit of compilation. A crate needs to have a unique name within a package, but its name is not guaranteed to be unique within a package registry (for example, we found 50 crates named main in our dataset).
A build is a crate from a specific version of a package compiled with the specific configuration flags.

When compiling all packages from crates.io, we often have multiple builds coming from the same crate. For example, this could happen when two packages A and B depends on different versions (or with different features) of the third package C. If we used all these builds in our analysis, we would get skewed results because the packages that have many versions and more feature flags would be included more times. Therefore, we defined a relation selected_builds that contains only a single build for each crate. The build is chosen by picking a build from the package version that is specified in CrateList.json. If there are more than one such build (for example, because of different feature flags), we choose the first one in the iteration order (basically at random). We also remove from selected_builds all builds from crates whose names start with build_script_ because they are results of compiling the build.rs files.