Qrates: Analysing Rust Code Corpus
Qrates is a tool for running large scale analyses of Rust code. To be scalable, the process is split into four phases:
- Data extraction.
extractor
is a modified version of the Rust compiler that saves the information about the compiled crate to a file so that it can be easily accessed later. - Database creation. To be able to run queries that span multiple crates, the information from multiple files need to be merged into a single database.
- Queries. The content of the database can be queried by using Rust programs. The procedural macros from Datapond can be used to write fix-point computations in Datalog.
- Query results analysis. Typically, the query results are saved as CSV files so that they can be easily visualized by using data analysis tools such as Pandas.
Compiling Qrates
Note: These instructions were tested on Ubuntu 18.04 and 20.04.
Install dependencies:
sudo apt install build-essential curl git libssl-dev
Install Rust:
curl https://sh.rustup.rs -sSf | sh
source $HOME/.cargo/env
Clone the repository and all its dependencies:
git clone https://github.com/rust-corpus/qrates.git
cd qrates
git submodule update --init
Build the project in release mode:
cargo build --all --release
Extracting Data from Crates Published on crates.io
This section shows how to extract data from crates published on crates.io.
Warning: The extraction involves compiling the crates and, therefore, may result in running potentially malicious code on your system. Therefore, make sure to compile the code on a virtual machine or in some other throw-away environment.
This section assumes that you have already compiled Qrates. You can find the instructions how to do that here.
Obtaining the List of Packages
The second step is to select the packages1 from which we want to extract the data. This list of packages should be stored in the file CrateList.json
. For example, if you want to analyse the package rand
, create CrateList.json
with the following contents:
A package uploaded on crates.io can contain several crates. For example, it is common for executables to be split into an executable crate and a library crate.
{
"creation_date": {
"secs_since_epoch": 1579271281,
"nanos_since_epoch": 847419904
},
"crates": [
{
"Package": {
"name": "rand",
"version": "0.7.3"
}
}
]
}
If you want to analyse all packages on crates.io, you can rename CrateList-all-2020-01-14.json
to CrateList.json
. That file contains the latest versions of all packages that were published on crates.io on 2020-01-14. Please note that compiling all packages requires at least 1 TB of hard drive space and running queries on so large dataset may require up to 200 GB of RAM. The dataset CrateList-top-200-2020-01-17.json
200 packages that were the most downloaded on crates.io; this dataset should be analysable on a laptop with 8 GB of RAM.
You can also create the list with the latest versions of all packages by running the following command:
cargo run --release -- init-all
Compiling the Packages
Note: Instead of compiling yourself, you can also download the extracted data from here.
Qrates uses the Rustwide library for compiling packages. Please see the Rustwide documentation for the system requirements; most importantly you need to have Docker installed (you can find the installation instructions here).
You can start the compilation as follows:
mkdir ../workspace
cargo run --release -- compile
This command may fail with a permission error if the user does not have the necessary permissions to communicate with the Docker manager. In that case, use sudo
:
sudo env "PATH=$PATH" $(which cargo) run --release -- compile
Attempting to compile all packages from crates.io on Lenovo ThinkPad T470p takes about a week. You can check how many packages already successfully compiled by running the following command:
ls ../workspace/rust-corpus/*/success | wc -l
Note: it is likely that the number of successfully compiled packages will be smaller than the one we reported in the paper because some of the packages were removed from crates.io in the meantime.
Checking Compilation Results
The overview of compilation errors can be seen by running the following command:
cargo run --release -- check-compilation
It will print the statistics of how many crates failed to compile for some common reason, how many failed likely due to a bug in the extractor, and how many failed for yet unrecognised reason. It will also print 5 paths of each of the latter groups.
Note: The classification is implemented in manager/src/compilation_checks.rs
.
If you are using an older CrateList
, it is very likely that many crates will fail to compile because their dependencies were removed from the registry (they were “yanked”). One workaround for this problem would be to fork the crates.io registry and then restore the removed crates by setting their yanked
flag to False
. It is also recommended to remove all package versions that are newer than the ones in the the crate list. Both these operations could be done by executing the following script from the root directory of the registry repository:
#!/usr/bin/python3
import json
import os
CRATE_LIST = '<path-to>/CrateList-all-2020-01-14.json'
def rewrite(path, package, versions):
"""Rewrite the cargo registry entry for `package` to contain only
the entries older than the one mentioned in `versions` and restore
all yanked version.
"""
if package not in versions:
# This package probably was published after we created the
# crates list and, therefore, will not appear among
# dependencies.
return
newest_version = versions[package]
with open(os.path.join(path, package)) as fp:
try:
lines = fp.read().splitlines()
except:
print(path, package)
raise
with open(os.path.join(path, package), 'w') as fp:
for line in lines:
data = json.loads(line)
if 'yanked' in data and data['yanked']:
data['yanked'] = False
json.dump(data, fp, separators=(',', ':'))
else:
fp.write(line)
fp.write('\n')
if data['vers'] == newest_version:
break
def main():
with open(CRATE_LIST) as fp:
crates = json.load(fp)['crates']
versions = dict(
(crate['Package']['name'], crate['Package']['version'])
for crate in crates
)
for root, dirs, files in os.walk('.'):
if root != '.':
for package in files:
rewrite(root, package, versions)
else:
dirs.remove('.git')
if __name__ == '__main__':
main()
The registry that matches CrateList-all-2020-01-14.json
can be found here. To try recompiling all failed packages with this registry, execute the following commands:
cargo run --release -- check-compilation --delete-failures
cargo run --release -- compile --purge-build-dir --custom-cargo-registry https://github.com/vakaras/crates.io-index
Moving Extracted Data
Since the extraction phase has quite different technical requirements from the later phases, it is common to execute these phases on different machines. The following command can be used to move deduplicated extracted data to a new directory for easy transfer:
cargo run --release -- move-extracted <target-dir>
The command sleeps for 20 seconds after it collects the list of files to move and performing the actual move to reduce the risk of moving half-written files.
It will also generate files.json
files that are then used by one of the queries to select the builds for analysis. Please note that some packages (for example, sheesy-cli-4.0.7
and actix-net-0.3.0
) are empty when compiled with default features, which results in files.json
being empty.
Extracting Data from a Private Rust Project
This section shows how to extract data from a private Rust project.
Building Qrates
The first step is to check out and compile Qrates:
git clone https://github.com/rust-corpus/qrates.git
cd qrates
git submodule update --init
cargo build --all --release
After the successful build, in the target/release
directory there should be an executable file called rustc
. We will extract the information by using this special rustc
to compile the project. To do so, we need to set environment variable RUSTC
to contain its path:
export RUSTC="$(pwd)/target/release/rustc"
We also need to set the environment variable SYSROOT
to contain the sysroot of the Rust version we used to compile Qrates and LD_LIBRARY_PATH
to contain the lib
directory in SYSROOT
:
export SYSROOT="$(rustc --print sysroot)"
export LD_LIBRARY_PATH="$SYSROOT/lib"
We also need to create a directory to store the extracted data and set the environment variable CORPUS_RESULTS_DIR
to point to it:
mkdir -p ../workspace/rust-corpus/
export CORPUS_RESULTS_DIR="$(pwd)/../workspace/rust-corpus/"
Compiling a Project
As an example, let's try to extract information from the master
branch of the rand crate.
Clone the rand
crate repository:
cd /tmp
git clone https://github.com/rust-random/rand.git
cd rand
Check that the environment variables RUSTC
, SYSROOT
, LD_LIBRARY_PATH
, and CORPUS_RESULTS_DIR
are set correctly.
Compile the project:
cargo build
If the compilation was successful, CORPUS_RESULTS_DIR
directory should contain many bincode
files:
$ ls $CORPUS_RESULTS_DIR
build_script_build_641a6913d88f2b1b.bincode ppv_lite86_89695c0a0a962fc8.bincode
build_script_build_679051cf1df6d8f8.bincode rand_0330e33c1ee64866.bincode
cfg_if_f903336a35b88a26.bincode rand_chacha_23c71b977e463cb8.bincode
getrandom_9a46159fdf341523.bincode rand_core_272caddfb637ce01.bincode
Creating the Database
To be able to run queries, the extracted information must be merged into a single database. Assuming you followed one of the previous sections for extracting files, you can create the database by running the following command from the directory in which you cloned Qrates:
cargo run --release -- update-database
This command expects to find the extracted files in directory ../workspace/rust-corpus/
. If you stored them somewhere else, you can specify the path to the workspace by using the --workspace
argument.
Database Structure
As was mentioned in the introduction, the process is split into four phases:
- Extracting the information about each crate.
- Merging the information into a single database.
- Running queries over the database.
- Analysing the query results.
This design was motivated by the fact that the extraction phase is time consuming (can take up to a week for the entire crates.io) and depends on internal compiler APIs that are unstable and change often: we decided to make the extractor as minimal as possible and handle the main workload in the later phases. This is also reflected in the Datalog database schema by splitting into two parts:
database/src/schema.dl
– core data, which is generated by the extractor. Modifications made to this file require rerunning the extractor.database/src/derived.dl
– derived data, or, in other words, data computed by the queries and stored in the database so that it can be reused by other queries.
The following subsections present how the data is stored in the database and present some of the most important derived queries.
Database Format
The database is inspired by Nicholas D. Matsakis blog post in which he suggested to make the Rust compiler to output Datalog which then could be used to write interesting queries. schema.dl
defines three kinds of elements:
- Types – we decided to use a strongly typed variant of Datalog; therefore, most elements have unique types.
- Interning tables – we use them for two main reasons:
- To reduce the memory requirements by replacing complex objects such as strings with unique integers and using them instead.
- To map a specific type such as
Package
to a generic type such asInternedString
.
- Datalog relations – that similarly to SQL tables capture the relational information between program elements.
derived.dl
can define additional relations.
From schema.dl
and derived.dl
, a procedural macro generates the code that manages the database. Most importantly, it generates the Tables
object that is used by the extractor to store the extracted data and the Loader
object that is used by the queries to load the data.
Fundamental Derived Queries
While, in theory, the core schema should be self-contained, the fact that changes to it require rerunning the extractor made us to put some of the fundamental relations to derived.dl
. The most important of such relations is selected_builds
and many other selected_*
relations that are derived from it. To understand the purpose of the selected_builds
relation, we need first to explain the differences between packages, crates, and builds:
- Packages are archives stored on crates.io. Packages have versions and their names are unique within a registry. A single package can define one or more crate. Packages can also define feature flags that allow customizing compilation by, for example, including some functions only if the specific feature is enabled.
- A crate is a unit of compilation. A crate needs to have a unique name within a package, but its name is not guaranteed to be unique within a package registry (for example, we found 50 crates named
main
in our dataset). - A build is a crate from a specific version of a package compiled with the specific configuration flags.
When compiling all packages from crates.io, we often have multiple builds coming from the same crate. For example, this could happen when two packages A
and B
depends on different versions (or with different features) of the third package C
. If we used all these builds in our analysis, we would get skewed results because the packages that have many versions and more feature flags would be included more times. Therefore, we defined a relation selected_builds
that contains only a single build for each crate. The build is chosen by picking a build from the package version that is specified in CrateList.json
. If there are more than one such build (for example, because of different feature flags), we choose the first one in the iteration order (basically at random). We also remove from selected_builds
all builds from crates whose names start with build_script_
because they are results of compiling the build.rs
files.
Running Existing Queries
To run all existing queries, execute:
cargo run --release -- query all
This will invoke the query all
that is a meta-query that runs all other queries. The queries are defined in manager/src/queries
. You can find the documentation of what exactly each of them does in their doc-comments.
Most queries store results in CSV files that can be found in the ../workspace/reports
directory.
Add a New Query
This chapter shows how to define your own query. Adding a new query typically involves the following steps:
- Checking the database and determining what information needs to be loaded.
- Implementing the function that takes the information and computes the desired result.
- Writing the result into a CSV file.
- Registering the query in the queries list.
Each of the following sections discuss each step in more detail. As an example, we use a query that finds the definitions of types that have raw pointers as fields and are not annotated as #[repr(C)]
(manager/src/queries/non_tree_types.rs
).
Database Structure
Before we can define a new query, we need to understand the database structure. The database schema is defined in two files:
database/src/schema.dl
– core data, which is generated by the extractor. Modifications made to this file require rerunning the extractor.database/src/derived.dl
– derived data, or, in other words, data computed by the queries and stored in the database so that it can be reused by other queries.
From these schemas, the procedural macros generate various data structures and functions. For writing queries, the most important data structure is Loader that allows loading the relations stored in the database as Rust vectors. &Loader
is passed as an argument to each query.
One very important derived relation is selected_builds
that is created from the CrateList.json
the by query prepare-builds
. Since we can have more than one build of the same crate (for example, if we had among dependencies different versions of a crate or the same crate with different configuration flags), to avoid duplicates in the analysis the selected_builds
relation stores which builds should be analysed by queries.
For our query, we are interested in three relations:
types_adt_field
– the relation between fields and their types.types_raw_ptr
– the relation that contains all types that are raw pointers.selected_adts
– the derived relation that contains the abstract data types such asenum
orstruct
defined inselected_builds
.
Computing the Relation
For your query, create a new module in manager/src/queries/mod.rs
. For example:
mod non_tree_types;
The module should contain the function query
:
pub fn query(loader: &Loader, report_path: &Path) {
// Query implementation.
}
Here, loader
is the Loader
object mentioned in the previous section and report_path
is the folder in which we should store the CSV files.
Before we write the result to a CSV file, we will obtain a vector of types that contain raw pointer fields. We can do this via a simple Datalog query (we are using Datapond library):
// Declare the output variable.
let non_tree_types;
datapond_query! {
// Load the relations by using “loader”.
load loader {
relations(types_adt_field, types_raw_ptr),
}
// Specify that “non_tree_types” is the output variable.
output non_tree_types(typ: Type)
// Define the relation by using a Datalog rule:
non_tree_types(adt) :-
types_adt_field(.adt=adt, .typ=typ),
types_raw_ptr(.typ=typ).
}
To generate the readable CSV file with the information, we need to traverse the list of all relevant adts, check for each of them whether it is one of the types from non_tree_types
and if yes, desugar to a human readable format. To make the checking more efficient, we can convert non_tree_types
from a vector to a hash set. The code would be:
let non_tree_types: HashSet<_> = non_tree_types.elements.iter().map(|&(typ,)| typ).collect();
let non_tree_adts = selected_adts.iter().flat_map(
|&(
build,
item,
typ,
def_path,
resolved_def_path,
name,
visibility,
type_kind,
def_kind,
kind,
c_repr,
is_phantom,
)| {
if non_tree_types.contains(&typ) {
Some((
build,
build_resolver.resolve(build),
item,
typ,
def_path_resolver.resolve(def_path),
def_path_resolver.resolve(resolved_def_path),
&strings[name],
visibility.to_string(),
&strings[type_kinds[type_kind]],
def_kind.to_string(),
kind.to_string(),
c_repr,
is_phantom,
))
} else {
None
}
},
);
Finally, we can write the results to the CSV file:
write_csv!(report_path, non_tree_adts);
The results will be written to a file ../workspace/reports/<query-name>/<iterator-variable>.csv
.
Analysing Query Results with Jupyter
Jupyter notebook is a web application commonly used by data scientists to analyse and visualise their data. If you have Docker installed (you can find the installation instructions here), you can start a local Jupyter instance as follows:
make run-jupyter
Note: Since using Docker requires root permissions, this command will ask for the sudo
password.
The command will print to the terminal a message like this:
[C 10:33:17.685 NotebookApp]
To access the notebook, open this file in a browser:
file:///home/jovyan/.local/share/jupyter/runtime/nbserver-27-open.html
Or copy and paste one of these URLs:
https://4ad49d6251da:8888/?token=202176e7bd7283e90ba6321c58472d193f41e27ba0da2b41
or https://127.0.0.1:8888/?token=202176e7bd7283e90ba6321c58472d193f41e27ba0da2b41
Click on one of the links to open the notebook in your default browser. The notebook uses a self-signed certificate and, as a result, your browser will show an SSL error. Just ignore it.
If everything started successfully, you should see three folders listed: data
, reports
, and work
. Click on reports
. It should contain six files with .ipynb
extensions–these are Python notebooks used to analyse the data presented in the paper.
After you open a notebook (for example, by clicking on Builds.ipynb
), you can re-execute it by choosing Kernel → Restart & Run All. (Note that some assert
statements in the notebooks assume the full dataset; feel free to comment them out.)