Speeding up Rust semver-checking by over 2000x

February 07, 2023 semver rust optimization query everything

This post describes work in progress: how cargo-semver-checks will benefit from the upcoming query optimization API in the Trustfall query engine. Read on to learn how a modern linter works under the hood, and how ideas from the world of databases can improve its performance.

Today, cargo-semver-checks is good enough to prevent real-world semver violations, and fast enough to earn a spot in the release pipeline of most crates. Releases only happen every so often, and semver violations are so painful, that spending an extra few minutes on that preemptive semver check is worth it.

This is good, but isn't good enough:

Checking large crates can be slow. The very crates where automated semver-checking is the most valuable [Sidenote: The larger the API surface area, the easier it is for a semver violation to sneak by — many semver breaks aren't obvious.] are the least able to use it.
Every new check added to the tool makes it even slower. We don't want to choose between making the tool more capable and keeping it fast enough to be usable.
In the long term, cargo-semver-checks should run on every pull request (not every release), analyzing the API changes to advise code reviewers before major breaking changes are merged. That must not drag down iteration speed.

In this post, I'll show you how upcoming cargo-semver-checks versions will become ludicrously fast on even the biggest crates — without changing any of the lint rules.

From five hours to eight seconds — 2272x faster

Meet the gigantic crate called aws-sdk-ec2. It is home to ~240,000 items (types, functions, constants, etc.). For comparison, the syn crate is commonly said to take a long time to compile, but only contains ~14,200 items [Sidenote: All numbers and measurements include all crate features.] — aws-sdk-ec2 is 17 times bigger!

While semver-checking most crates finishes in a minute or so, for aws-sdk-ec2 that process took over 5 hours. [Sidenote: Throughout this post, we're ignoring rustdoc generation time, JSON parsing time, etc.] [Sidenote: You may have noticed that the slowdown is more than linear. Hold that thought 😉]

     Checking aws-sdk-ec2 v0.24.0 -> v0.24.0 (no change)
     Starting 40 checks, 0 unnecessary
    Completed [18178.109s] 40 checks; 40 passed, 0 unnecessary

Faster tools get used more often, so linters in the "overnight" performance tier are more or less useless. The optimizations in this post improve performance by several tiers:

    Checking aws-sdk-ec2 v0.24.0 -> v0.24.0 (no change)
    Starting 40 checks, 0 unnecessary
   Completed [   8.003s] 40 checks; 40 passed, 0 unnecessary

The slower a check previously ran, the more it benefited from the optimizations. [Sidenote: The verbose logs showing timings for each semver lint individually in each run are available here.] The most expensive lint in the tool is method_parameter_count_changed, and its runtime went from 4887.539s (~81.5min) to 1.086s — a speedup of 4500x. [Sidenote: This is entirely serial (single-core) speedup. Adding rayon here would make everything even faster, ~proportionally to the number of cores on our system. We'll save that for a future post 😉]

The best part: we don't have to change anything about how we develop cargo-semver-checks. In fact, the only meaningful change to cargo-semver-checks is a version bump in its dependencies! Don't take my word for it — here's the diff that applies the optimization to cargo-semver-checks.

To understand how that's possible, we need to dig into how cargo-semver-checks is designed.

TL;DR + section index

cargo-semver-checks queries today are slow because many items need to be matched across API versions to their counterparts with the same name. Currently, that's an O(n^2) process (the sweet spot of poor scaling), but can optimized down to O(n).

In a traditional database, the query engine and the data storage system are closely tied together. That makes optimizations easier, but limits the kinds of data that can be used with the system — and wouldn't be feasible here.

Instead, cargo-semver-checks is built on top of Trustfall, a query engine that supports querying any kind of data source by plugging it in via a special API. This post previews an upcoming addition to that Trustfall API: one that enables new kinds of optimizations within data providers, including the ones that would dramatically speed up cargo-semver-checks.

If you're already familiar with the cargo-semver-checks query-based architecture, Trustfall's "everything can be a database" approach, or how database indexes work, here's the list of sections so you can skip ahead:

Semver-checking is just a special kind of querying

In cargo-semver-checks, we treat finding semver violations as a query problem.

Traditional database or API queries look like "find Predrag's latest posts and how many likes they have." Our queries look like "find functions that existed in the previous version but don't exist in the new version," because that's a major breaking change. The dataset isn't the same, but the query structure isn't so different, no?

The query branches into two arms: one for the old version and one for the new version of the crate. Each version node is connected to a "function" node, which the query specifies must both have the same import path. The query also specifies that it expects a total of zero such matching function nodes in the new version of the crate — i.e. that no such functions exist.

When we run a query like that, most often it doesn't find anything. That's good: no semver violations of that flavor! If the query found anything, then either the new version is a new major release, [Sidenote: We actually check ahead of time what kind of version bump happened, and then we don't bother running queries that look for changes allowed by that version bump.] or that's an instance of a semver violation 🚨

The query-based approach to linting has several advantages.

One is that lints are spectacularly easy to write. Each cargo-semver-checks lint is not much more than:

a query (+ any arguments) to be run over the target crate's API information;
a setting for the minimum kind of version bump where the change is allowed, and
a few error message templates for displaying the findings to the user.

Another advantage is that the "what" and the "how" are separate. This is also called declarative programming: if you've written SQL, recall that you specify what data you'd like retrieved without specifying how to go about finding it. You don't write loops or implement join algorithms — you declare that a JOIN shall happen, and lo, it does! Meanwhile, the "how" is up to the query engine and is allowed to change, both to accomodate new data formats [Sidenote: The rustdoc JSON format is not stable. Most Rust versions ship with a new major version of the format, incompatible with prior ones. This separation allows cargo-semver-checks lints to be agnostic to the underlying rustdoc JSON format version, and thus support multiple format versions simultaneously.] and to apply optimizations.

With the right tools, anything can be a database

We cannot query semver and crate APIs with SQL. [Sidenote: Technically, we maybe could ... but would we really want to?]

Instead, cargo-semver-checks uses a query engine called Trustfall whose superpower is that it can query anything as if it were a database. [Sidenote: And without needing to ETL the data into a database, since that frequently isn't practical or desirable.] "Anything" is not really an exaggeration: APIs, databases, raw files like CSV or JSON, even ML models — or any combination of those — is all fair game, and something Trustfall has already powered in the real world. Here's a talk about querying databases, source code, and deployment configuration, and here's a playground for querying HackerNews' API directly from your browser — for example, "Which Twitter or GitHub users comment on HN stories about OpenAI?"

How do you plug in a data source into Trustfall? By implementing an Adapter! This is an interface that on one side understands how to access some dataset, and on the other side understands how to talk to Trustfall. [Sidenote: This is similar to ideas like Postgres Foreign Data Wrappers (FDW). However, as this is the only way to plug data into Trustfall, we must give it first-class treatment and cannot afford to treat it as a "nice to have" feature.] For cargo-semver-checks, the dataset being queried is rustdoc JSON, and the trustfall-rustdoc-adapter crate provides an adapter over that format.

The Adapter allows the "business logic" of linting to stay independent from the details of the rustdoc JSON format. You won't find any format-related logic inside cargo-semver-checks lints. Lints are queries over the rustdoc dataset, adapters make the dataset available for querying, and Trustfall handles the plumbing in between.

A diagram with three layers of blocks stacked on top of each other. The top layer is cargo-semver-checks, handling the linting logic. It sits on top of Trustfall, which handles the query logic. Trustfall sits on top of four side-by-side adapter blocks, each handling the data format logic for different Rust version's rustdoc format: Rust 1.65, 1.66, 1.67, and nightly.

When cargo-semver-checks asks Trustfall to execute a query, Trustfall compiles the query into a series of data-fetching or data-manipulation (filtering, aggregating, etc.) operations [Sidenote: A dataflow program, the same general idea as in Apache Spark and TensorFlow.] that can be executed on-demand ("lazily") to produce results. You've definitely seen and used this kind of interface before — it's built into Rust under the name Iterator. The data-fetching operations are executed by calling into the Adapter, whereas the data manipulation operations are not dependent on the data source so they are built into Trustfall itself.

Query optimization — as an API

Have you ever had a slow SQL query get faster because of a well-crafted CREATE INDEX command? We want to do something similar here: keep the semver queries the same, but speed them up by making it easier to look up the right records — O(1) time per record instead of the current O(n).

Our case is much trickier than SQL's, though. SQL databases both get to see the full query and also get to control how the data is stored and represented. In our case, those are separate components: Trustfall doesn't know if the data is coming from a file on local disk, a remote API or database, or from an engineer flipping a coin on-demand. In fact, the local disk might not even be accessible, since Trustfall may be running as WASM in a heavily restricted browser environment as in Trustfall Playground.

Some query optimizations aren't tied to the particular data shape or format, so Trustfall can apply them automatically on its own. For example, say an aggregation has a filter clause like "has no more than 3 elements," and Trustfall finds a fourth element inside the aggregation — it knows it can stop there and doesn't need to see if a fifth one exists too, since the predicate will never match again.

For all other kinds of optimizations, Trustfall can only offer information (via the adapter API) about what the query is doing. It's then up to the adapter to fetch the required data in the best way the underlying data source allows — for example, using a cache or index if one is available, or whenever possible choosing a faster, more targeted API endpoint instead of a slower, more general one.

Aside: Adapter API design goals

The API faces two competing objectives:

Low barrier to entry: We'll want adapters for all sorts of data sources. The API must be as simple to implement as possible, because we can't afford to write half a query engine for every new data source.
High ceiling: We can't afford to invest into adapters that will cease to scale with our needs. When a query gets too slow, the API must support adding whatever optimization is needed to make it fast again. [Sidenote: This is the origin of the name Trustfall: words are just words, so the only way to know for sure if it'll work is to try it. At that point, it had better work! The purpose of posts like these is to show that it does indeed work, even in extreme cases.]

To repurpose a famous phrase, the perfect API is not one where there is nothing left to add, but when there's nothing left to take away. We want the minimal possible API that can still support everything we might ever want.

That is why I was the right person to build Trustfall: because I've spent 8 years (and counting!) querying and optimizing all kinds of weird and wonderful data sources, and my background prior to that is in compilers and performance engineering.

Index lookup in `O(1)` vs full table scan in `O(n)`

We mentioned earlier that our queries look like "find functions that existed in the previous version but don't exist in the new version." Here's the simplest way to accomplish that, in pseudo-Rust:

let mut missing_functions = vec![];
for previous_func in previous_crate.functions.iter() {
    let import_path = &previous_func.import_path;

    // Loop over the current crate's functions
    // until we find the one we're looking for.
    let found_it = current_crate.functions
        .iter()
        .filter(|current_fn| {
            &current_fn.import_path == import_path
        })
        .any();

    if !found_it {
        missing_functions.push(previous_func);
    }
}

If the crate has n functions in each version, the total runtime here is O(n^2).

This is roughly what happens in a SQL database when no index is available — it's called a "full table scan." The same happens in cargo-semver-checks v0.17, and it's still fast enough for many crates.

But imagine we had a hashtable — an "index" — over the import_path property of functions. Then our pseudo-Rust would look like:

let mut missing_functions = vec![];
for previous_func in previous_crate.functions.iter() {
    let import_path = &previous_func.import_path;

    // Our hashtable of import paths, mapping
    // import paths -> the functions themselves.
    let index = &current_crate.import_path_index;

    // O(1) lookup instead of O(n)
    let found_it = index.contains_key(import_path);

    if !found_it {
        missing_functions.push(previous_func);
    }
}

In database terminology, we just switched to an "index scan." To check all n functions, the total cost here is only O(n) instead of O(n^2) — though there's also a small additional cost since we have to maintain that index. Definitely an improvement!

All our semver lints compare how some piece of the API worked before versus how it works now. An index that speeds up finding those "corresponding items" would make all lint queries run faster — hence the huge speedup!

Some lint queries even match multiple items at a time, for example: "for each type and for each method implemented on that type, ensure that method still exists on that type." For that kind of query, an index that maps (type, method_name) tuples to the method's implementation details would help as well.

The entire 2272x speedup is due to making it possible to create and use these two kinds of indexes. Of course, the magic is in applying the index optimization without any of the queries noticing!

How Trustfall makes this optimization possible

Quick recap:

cargo-semver-checks queries are slow because many items need to be matched across API versions to their counterparts with the same name. This is normally an O(n^2) process, but can be made O(n) by the use of indexes.
Unlike in a traditional database, our data provider (trustfall-rustdoc-adapter) and query engine (Trustfall) are two separate libraries that communicate over an API. This API must be easy to implement, yet powerful enough to apply necessary optimizations.

We now need two things to happen.

First, trustfall-rustdoc-adapter, the component that understands rustdoc JSON, needs to be updated to build our two indexes: one to look up items by name, and another to look up method implementations by type and method name. This is straightforward: after loading the rustdoc JSON, iterate over the items and construct the hashtables we need. [Sidenote: We're glossing over details here: multiple items may share the same import path, the same type can have multiple methods with the same name, etc. These are "be careful" details, not deal-breakers.] It all finishes in a fraction of a second, and we'll be able to reuse the indexes across multiple queries so their construction cost is further amortized.

Now, whenever Trustfall requests data over the Adapter API, trustfall-rustdoc-adapter needs to know whether any of the indexes can be used to help. The API allows a "fast path" optimization:

When Trustfall calls the adapter requesting an iterator over the items in a crate, the minimum valid implementation is to return an iterator over all the items.
Within the context of that request, the adapter can use the new optimization functionality to ask Trustfall: "Did the query have a specific item name it's looking for?" If yes — fast path — it applies the index and returns an iterator over only the matching items. [Sidenote: This "run the filter before resolving the edge" optimization is also called "predicate pushdown" or "filter pushdown." This refers to how the transformation looks when presented in the traditional tree diagram common in relational algebra, where leaf nodes get executed first and the optimization "pushes" the filter node down below the join node instead of keeping it above.] If not — slow path — it returns an iterator over all items as before.

In the current (prototype!) API, the adapter's fast path looks like this: [Sidenote: Here's the PR with the most recent iteration of the design.]

// Inside the code that resolves `item` edges, we ask:
// Will the query take the produced items,
// look up their `importable_path` edge, and then
// do an equality check on the `path` property there?
//
// If so, we'd like to look up that path value
// *before* resolving the item edge.
if let Some(value_resolver) = query_info
    .destination()                  // -> item vertex info
    .first_edge("importable_path")  // edge's own info
    .destination()                  // -> path vertex info
    .dynamic_field_value("path")    // implied property value
{
    // Fast path: we're looking up items by path.
    // The returned `value_resolver` can tell us
    // the specific path needed for each result.
    resolve_items_using_name_index(value_resolver, ...)
} else {
    // Slow path: the query is doing something else.
    // Resolve all items normally.
    resolve_all_items(...)
}

This fulfills our "low barrier to entry, but high ceiling" API design goal from the earlier section.

To start, adapters can completely ignore the query_info value given to them. The initial adapter implementation would only do the equivalent of resolve_all_items() here.

Over time, if some queries become too slow, the adapter can evolve to meet their needs: for example, by adding batching, caching, or indexes appropriate for the specific dataset and query. That improves query performance (perhaps by over 2000x like in this case!) while remaining otherwise invisible.

Conclusion

Performance is a feature — especially for a linter. Unfortunately, high performance can often come at the cost of inflexible and complex code that is unfriendly to newcomers and existing team members alike: it's often full of "sharp edges" that can cause regressions, and can be tedious to maintain because it's so intricate and non-obvious.

Trustfall lets cargo-semver-checks specify lints using a structured, declarative, type-checked query language, all without worrying about changes in the underlying rustdoc data format nor paying a steep performance cost. It's safe to add lints first and optimize as needed later — so much so, that the most common "new contributor" task in cargo-semver-checks is implementing a new lint.

It's refreshing to know that any necessary optimizations can be added later and without any query changes. As a maintainer of cargo-semver-checks, it gives me peace of mind.

Performance and convenience can go hand in hand after all.