Qrates: Analysing Rust Code Corpus

Qrates is a tool for running large scale analyses of Rust code. To be scalable, the process is split into four phases:

Data extraction. extractor is a modified version of the Rust compiler that saves the information about the compiled crate to a file so that it can be easily accessed later.
Database creation. To be able to run queries that span multiple crates, the information from multiple files need to be merged into a single database.
Queries. The content of the database can be queried by using Rust programs. The procedural macros from Datapond can be used to write fix-point computations in Datalog.
Query results analysis. Typically, the query results are saved as CSV files so that they can be easily visualized by using data analysis tools such as Pandas.

Compiling Qrates

Note: These instructions were tested on Ubuntu 18.04 and 20.04.

Install dependencies:

sudo apt install build-essential curl git libssl-dev

Install Rust:

curl https://sh.rustup.rs -sSf | sh
source $HOME/.cargo/env

Clone the repository and all its dependencies:

git clone https://github.com/rust-corpus/qrates.git
cd qrates
git submodule update --init

Build the project in release mode:

cargo build --all --release

Extracting Data from Crates Published on crates.io

This section shows how to extract data from crates published on crates.io.

Warning: The extraction involves compiling the crates and, therefore, may result in running potentially malicious code on your system. Therefore, make sure to compile the code on a virtual machine or in some other throw-away environment.

This section assumes that you have already compiled Qrates. You can find the instructions how to do that here.

Obtaining the List of Packages

The second step is to select the packages¹ from which we want to extract the data. This list of packages should be stored in the file CrateList.json. For example, if you want to analyse the package rand, create CrateList.json with the following contents:

A package uploaded on crates.io can contain several crates. For example, it is common for executables to be split into an executable crate and a library crate.

{
  "creation_date": {
    "secs_since_epoch": 1579271281,
    "nanos_since_epoch": 847419904
  },
  "crates": [
    {
      "Package": {
        "name": "rand",
        "version": "0.7.3"
      }
    }
  ]
}

If you want to analyse all packages on crates.io, you can rename CrateList-all-2020-01-14.json to CrateList.json. That file contains the latest versions of all packages that were published on crates.io on 2020-01-14. Please note that compiling all packages requires at least 1 TB of hard drive space and running queries on so large dataset may require up to 200 GB of RAM. The dataset CrateList-top-200-2020-01-17.json 200 packages that were the most downloaded on crates.io; this dataset should be analysable on a laptop with 8 GB of RAM.

You can also create the list with the latest versions of all packages by running the following command:

cargo run --release -- init-all

Compiling the Packages

Note: Instead of compiling yourself, you can also download the extracted data from here.

Qrates uses the Rustwide library for compiling packages. Please see the Rustwide documentation for the system requirements; most importantly you need to have Docker installed (you can find the installation instructions here).

You can start the compilation as follows:

mkdir ../workspace
cargo run --release -- compile

This command may fail with a permission error if the user does not have the necessary permissions to communicate with the Docker manager. In that case, use sudo:

sudo env "PATH=$PATH" $(which cargo) run --release -- compile

Attempting to compile all packages from crates.io on Lenovo ThinkPad T470p takes about a week. You can check how many packages already successfully compiled by running the following command:

ls ../workspace/rust-corpus/*/success | wc -l

Note: it is likely that the number of successfully compiled packages will be smaller than the one we reported in the paper because some of the packages were removed from crates.io in the meantime.

Checking Compilation Results

The overview of compilation errors can be seen by running the following command:

cargo run --release -- check-compilation

It will print the statistics of how many crates failed to compile for some common reason, how many failed likely due to a bug in the extractor, and how many failed for yet unrecognised reason. It will also print 5 paths of each of the latter groups.

Note: The classification is implemented in manager/src/compilation_checks.rs.

If you are using an older CrateList, it is very likely that many crates will fail to compile because their dependencies were removed from the registry (they were “yanked”). One workaround for this problem would be to fork the crates.io registry and then restore the removed crates by setting their yanked flag to False. It is also recommended to remove all package versions that are newer than the ones in the the crate list. Both these operations could be done by executing the following script from the root directory of the registry repository:

#!/usr/bin/python3

import json
import os

CRATE_LIST = '<path-to>/CrateList-all-2020-01-14.json'

def rewrite(path, package, versions):
    """Rewrite the cargo registry entry for `package` to contain only
    the entries older than the one mentioned in `versions` and restore
    all yanked version.
    """
    if package not in versions:
        # This package probably was published after we created the
        # crates list and, therefore, will not appear among
        # dependencies.
        return
    newest_version = versions[package]
    with open(os.path.join(path, package)) as fp:
        try:
            lines = fp.read().splitlines()
        except:
            print(path, package)
            raise
    with open(os.path.join(path, package), 'w') as fp:
        for line in lines:
            data = json.loads(line)
            if 'yanked' in data and data['yanked']:
                data['yanked'] = False
                json.dump(data, fp, separators=(',', ':'))
            else:
                fp.write(line)
            fp.write('\n')
            if data['vers'] == newest_version:
                break

def main():
    with open(CRATE_LIST) as fp:
        crates = json.load(fp)['crates']
        versions = dict(
            (crate['Package']['name'], crate['Package']['version'])
            for crate in crates
        )

    for root, dirs, files in os.walk('.'):
        if root != '.':
            for package in files:
                rewrite(root, package, versions)
        else:
            dirs.remove('.git')

if __name__ == '__main__':
    main()

The registry that matches CrateList-all-2020-01-14.json can be found here. To try recompiling all failed packages with this registry, execute the following commands:

cargo run --release -- check-compilation --delete-failures
cargo run --release -- compile --purge-build-dir --custom-cargo-registry https://github.com/vakaras/crates.io-index

Moving Extracted Data

Since the extraction phase has quite different technical requirements from the later phases, it is common to execute these phases on different machines. The following command can be used to move deduplicated extracted data to a new directory for easy transfer:

cargo run --release -- move-extracted <target-dir>

The command sleeps for 20 seconds after it collects the list of files to move and performing the actual move to reduce the risk of moving half-written files.

It will also generate files.json files that are then used by one of the queries to select the builds for analysis. Please note that some packages (for example, sheesy-cli-4.0.7 and actix-net-0.3.0) are empty when compiled with default features, which results in files.json being empty.

Extracting Data from a Private Rust Project

This section shows how to extract data from a private Rust project.

Building Qrates

The first step is to check out and compile Qrates:

git clone https://github.com/rust-corpus/qrates.git
cd qrates
git submodule update --init
cargo build --all --release

After the successful build, in the target/release directory there should be an executable file called rustc. We will extract the information by using this special rustc to compile the project. To do so, we need to set environment variable RUSTC to contain its path:

export RUSTC="$(pwd)/target/release/rustc"

We also need to set the environment variable SYSROOT to contain the sysroot of the Rust version we used to compile Qrates and LD_LIBRARY_PATH to contain the lib directory in SYSROOT:

export SYSROOT="$(rustc --print sysroot)"
export LD_LIBRARY_PATH="$SYSROOT/lib"

We also need to create a directory to store the extracted data and set the environment variable CORPUS_RESULTS_DIR to point to it:

mkdir -p ../workspace/rust-corpus/
export CORPUS_RESULTS_DIR="$(pwd)/../workspace/rust-corpus/"

Compiling a Project

As an example, let's try to extract information from the master branch of the rand crate.

Clone the rand crate repository:

cd /tmp
git clone https://github.com/rust-random/rand.git
cd rand

Check that the environment variables RUSTC, SYSROOT, LD_LIBRARY_PATH, and CORPUS_RESULTS_DIR are set correctly.

Compile the project:

cargo build

If the compilation was successful, CORPUS_RESULTS_DIR directory should contain many bincode files:

$ ls $CORPUS_RESULTS_DIR
build_script_build_641a6913d88f2b1b.bincode  ppv_lite86_89695c0a0a962fc8.bincode
build_script_build_679051cf1df6d8f8.bincode  rand_0330e33c1ee64866.bincode
cfg_if_f903336a35b88a26.bincode              rand_chacha_23c71b977e463cb8.bincode
getrandom_9a46159fdf341523.bincode           rand_core_272caddfb637ce01.bincode

Creating the Database

To be able to run queries, the extracted information must be merged into a single database. Assuming you followed one of the previous sections for extracting files, you can create the database by running the following command from the directory in which you cloned Qrates:

cargo run --release -- update-database

This command expects to find the extracted files in directory ../workspace/rust-corpus/. If you stored them somewhere else, you can specify the path to the workspace by using the --workspace argument.

Database Structure

As was mentioned in the introduction, the process is split into four phases:

Extracting the information about each crate.
Merging the information into a single database.
Running queries over the database.
Analysing the query results.

This design was motivated by the fact that the extraction phase is time consuming (can take up to a week for the entire crates.io) and depends on internal compiler APIs that are unstable and change often: we decided to make the extractor as minimal as possible and handle the main workload in the later phases. This is also reflected in the Datalog database schema by splitting into two parts:

database/src/schema.dl – core data, which is generated by the extractor. Modifications made to this file require rerunning the extractor.
database/src/derived.dl – derived data, or, in other words, data computed by the queries and stored in the database so that it can be reused by other queries.

The following subsections present how the data is stored in the database and present some of the most important derived queries.

Database Format

The database is inspired by Nicholas D. Matsakis blog post in which he suggested to make the Rust compiler to output Datalog which then could be used to write interesting queries. schema.dl defines three kinds of elements:

Types – we decided to use a strongly typed variant of Datalog; therefore, most elements have unique types.
Interning tables – we use them for two main reasons:
1. To reduce the memory requirements by replacing complex objects such as strings with unique integers and using them instead.
2. To map a specific type such as Package to a generic type such as InternedString.
Datalog relations – that similarly to SQL tables capture the relational information between program elements.

derived.dl can define additional relations.

From schema.dl and derived.dl, a procedural macro generates the code that manages the database. Most importantly, it generates the Tables object that is used by the extractor to store the extracted data and the Loader object that is used by the queries to load the data.

Fundamental Derived Queries

While, in theory, the core schema should be self-contained, the fact that changes to it require rerunning the extractor made us to put some of the fundamental relations to derived.dl. The most important of such relations is selected_builds and many other selected_* relations that are derived from it. To understand the purpose of the selected_builds relation, we need first to explain the differences between packages, crates, and builds:

Packages are archives stored on crates.io. Packages have versions and their names are unique within a registry. A single package can define one or more crate. Packages can also define feature flags that allow customizing compilation by, for example, including some functions only if the specific feature is enabled.
A crate is a unit of compilation. A crate needs to have a unique name within a package, but its name is not guaranteed to be unique within a package registry (for example, we found 50 crates named main in our dataset).
A build is a crate from a specific version of a package compiled with the specific configuration flags.

When compiling all packages from crates.io, we often have multiple builds coming from the same crate. For example, this could happen when two packages A and B depends on different versions (or with different features) of the third package C. If we used all these builds in our analysis, we would get skewed results because the packages that have many versions and more feature flags would be included more times. Therefore, we defined a relation selected_builds that contains only a single build for each crate. The build is chosen by picking a build from the package version that is specified in CrateList.json. If there are more than one such build (for example, because of different feature flags), we choose the first one in the iteration order (basically at random). We also remove from selected_builds all builds from crates whose names start with build_script_ because they are results of compiling the build.rs files.

Running Existing Queries

To run all existing queries, execute:

cargo run --release -- query all

This will invoke the query all that is a meta-query that runs all other queries. The queries are defined in manager/src/queries. You can find the documentation of what exactly each of them does in their doc-comments.

Most queries store results in CSV files that can be found in the ../workspace/reports directory.

Add a New Query

This chapter shows how to define your own query. Adding a new query typically involves the following steps:

Checking the database and determining what information needs to be loaded.
Implementing the function that takes the information and computes the desired result.
Writing the result into a CSV file.
Registering the query in the queries list.

Each of the following sections discuss each step in more detail. As an example, we use a query that finds the definitions of types that have raw pointers as fields and are not annotated as #[repr(C)] (manager/src/queries/non_tree_types.rs).

Database Structure

Before we can define a new query, we need to understand the database structure. The database schema is defined in two files:

database/src/schema.dl – core data, which is generated by the extractor. Modifications made to this file require rerunning the extractor.
database/src/derived.dl – derived data, or, in other words, data computed by the queries and stored in the database so that it can be reused by other queries.

From these schemas, the procedural macros generate various data structures and functions. For writing queries, the most important data structure is Loader that allows loading the relations stored in the database as Rust vectors. &Loader is passed as an argument to each query.

One very important derived relation is selected_builds that is created from the CrateList.json the by query prepare-builds. Since we can have more than one build of the same crate (for example, if we had among dependencies different versions of a crate or the same crate with different configuration flags), to avoid duplicates in the analysis the selected_builds relation stores which builds should be analysed by queries.

For our query, we are interested in three relations:

types_adt_field – the relation between fields and their types.
types_raw_ptr – the relation that contains all types that are raw pointers.
selected_adts – the derived relation that contains the abstract data types such as enum or struct defined in selected_builds.

Computing the Relation

For your query, create a new module in manager/src/queries/mod.rs. For example:

mod non_tree_types;

The module should contain the function query:

pub fn query(loader: &Loader, report_path: &Path) {
    // Query implementation.
}

Here, loader is the Loader object mentioned in the previous section and report_path is the folder in which we should store the CSV files.

Before we write the result to a CSV file, we will obtain a vector of types that contain raw pointer fields. We can do this via a simple Datalog query (we are using Datapond library):

// Declare the output variable.
let non_tree_types;
datapond_query! {
    // Load the relations by using “loader”.
    load loader {
        relations(types_adt_field, types_raw_ptr),
    }
    // Specify that “non_tree_types” is the output variable.
    output non_tree_types(typ: Type)
    // Define the relation by using a Datalog rule:
    non_tree_types(adt) :-
        types_adt_field(.adt=adt, .typ=typ),
        types_raw_ptr(.typ=typ).
}

To generate the readable CSV file with the information, we need to traverse the list of all relevant adts, check for each of them whether it is one of the types from non_tree_types and if yes, desugar to a human readable format. To make the checking more efficient, we can convert non_tree_types from a vector to a hash set. The code would be:

let non_tree_types: HashSet<_> = non_tree_types.elements.iter().map(|&(typ,)| typ).collect();
let non_tree_adts = selected_adts.iter().flat_map(
    |&(
        build,
        item,
        typ,
        def_path,
        resolved_def_path,
        name,
        visibility,
        type_kind,
        def_kind,
        kind,
        c_repr,
        is_phantom,
    )| {
        if non_tree_types.contains(&typ) {
            Some((
                build,
                build_resolver.resolve(build),
                item,
                typ,
                def_path_resolver.resolve(def_path),
                def_path_resolver.resolve(resolved_def_path),
                &strings[name],
                visibility.to_string(),
                &strings[type_kinds[type_kind]],
                def_kind.to_string(),
                kind.to_string(),
                c_repr,
                is_phantom,
            ))
        } else {
            None
        }
    },
);

Finally, we can write the results to the CSV file:

write_csv!(report_path, non_tree_adts);

The results will be written to a file ../workspace/reports/<query-name>/<iterator-variable>.csv.

Analysing Query Results with Jupyter

Jupyter notebook is a web application commonly used by data scientists to analyse and visualise their data. If you have Docker installed (you can find the installation instructions here), you can start a local Jupyter instance as follows:

make run-jupyter

Note: Since using Docker requires root permissions, this command will ask for the sudo password.

The command will print to the terminal a message like this:

[C 10:33:17.685 NotebookApp]

    To access the notebook, open this file in a browser:
        file:///home/jovyan/.local/share/jupyter/runtime/nbserver-27-open.html
    Or copy and paste one of these URLs:
        https://4ad49d6251da:8888/?token=202176e7bd7283e90ba6321c58472d193f41e27ba0da2b41
     or https://127.0.0.1:8888/?token=202176e7bd7283e90ba6321c58472d193f41e27ba0da2b41

Click on one of the links to open the notebook in your default browser. The notebook uses a self-signed certificate and, as a result, your browser will show an SSL error. Just ignore it.

If everything started successfully, you should see three folders listed: data, reports, and work. Click on reports. It should contain six files with .ipynb extensions–these are Python notebooks used to analyse the data presented in the paper.

After you open a notebook (for example, by clicking on Builds.ipynb), you can re-execute it by choosing Kernel → Restart & Run All. (Note that some assert statements in the notebooks assume the full dataset; feel free to comment them out.)