Database Structure
As was mentioned in the introduction, the process is split into four phases:
- Extracting the information about each crate.
- Merging the information into a single database.
- Running queries over the database.
- Analysing the query results.
This design was motivated by the fact that the extraction phase is time consuming (can take up to a week for the entire crates.io) and depends on internal compiler APIs that are unstable and change often: we decided to make the extractor as minimal as possible and handle the main workload in the later phases. This is also reflected in the Datalog database schema by splitting into two parts:
database/src/schema.dl
– core data, which is generated by the extractor. Modifications made to this file require rerunning the extractor.database/src/derived.dl
– derived data, or, in other words, data computed by the queries and stored in the database so that it can be reused by other queries.
The following subsections present how the data is stored in the database and present some of the most important derived queries.
Database Format
The database is inspired by Nicholas D. Matsakis blog post in which he suggested to make the Rust compiler to output Datalog which then could be used to write interesting queries. schema.dl
defines three kinds of elements:
- Types – we decided to use a strongly typed variant of Datalog; therefore, most elements have unique types.
- Interning tables – we use them for two main reasons:
- To reduce the memory requirements by replacing complex objects such as strings with unique integers and using them instead.
- To map a specific type such as
Package
to a generic type such asInternedString
.
- Datalog relations – that similarly to SQL tables capture the relational information between program elements.
derived.dl
can define additional relations.
From schema.dl
and derived.dl
, a procedural macro generates the code that manages the database. Most importantly, it generates the Tables
object that is used by the extractor to store the extracted data and the Loader
object that is used by the queries to load the data.
Fundamental Derived Queries
While, in theory, the core schema should be self-contained, the fact that changes to it require rerunning the extractor made us to put some of the fundamental relations to derived.dl
. The most important of such relations is selected_builds
and many other selected_*
relations that are derived from it. To understand the purpose of the selected_builds
relation, we need first to explain the differences between packages, crates, and builds:
- Packages are archives stored on crates.io. Packages have versions and their names are unique within a registry. A single package can define one or more crate. Packages can also define feature flags that allow customizing compilation by, for example, including some functions only if the specific feature is enabled.
- A crate is a unit of compilation. A crate needs to have a unique name within a package, but its name is not guaranteed to be unique within a package registry (for example, we found 50 crates named
main
in our dataset). - A build is a crate from a specific version of a package compiled with the specific configuration flags.
When compiling all packages from crates.io, we often have multiple builds coming from the same crate. For example, this could happen when two packages A
and B
depends on different versions (or with different features) of the third package C
. If we used all these builds in our analysis, we would get skewed results because the packages that have many versions and more feature flags would be included more times. Therefore, we defined a relation selected_builds
that contains only a single build for each crate. The build is chosen by picking a build from the package version that is specified in CrateList.json
. If there are more than one such build (for example, because of different feature flags), we choose the first one in the iteration order (basically at random). We also remove from selected_builds
all builds from crates whose names start with build_script_
because they are results of compiling the build.rs
files.