How to Query (Almost) Everything

July 22, 2024 annotated talk linting query everything python rust

In 2022, I gave a talk at a virtual conference with an unforgettable name: HYTRADBOI, which stands for "Have You Tried Rubbing a Database On It?" Its goal was to discuss unconventional uses of database-like technology, and featured many excellent talks.

My talk "How to Query (Almost) Everything" received copious praise. It describes the Trustfall query engine's architecture, and includes real-world examples of how my (now-former) employer relies on it to statically catch and prevent cross-domain bugs across a monorepo with hundreds of services and shared libraries. For example:

Catch Python services that specify one Python version in their pyproject.toml manifest, and a different (incompatible) Python version in their deployment Dockerfile.
Catch Python libraries that specify type hints, but whose CI configuration has disabled mypy type-checking, meaning that the included type hints may be incorrect.

The talk's contents are still relevant today, with Trustfall now powering tools like cargo-semver-checks. And as far as I know, my former coworkers are still happily using everything described in the talk! As I am no longer employed at that company, everything in this post is written in a personal capacity, and may not represent the past or present opinions of my prior employer. Also, don't read too much into my departure — I was there for 7+ years, left on good terms, and many of my friends still work there. I hope you enjoy!

You can watch the talk video on the conference site, on YouTube, or embedded below. Read on for an annotated version of the talk, I believe Simon Willison coined the term "annotated talk", and described it on his blog. I like this idea, and I'm broadly aiming to follow the same approach. covering the same ideas in written form.

Outline

Problem: Querying non-database data is hard

Jump to this chapter in the video.

We're in a funny situation these days: queries that are trivial in a SQL database are painfully difficult over any other data source.

Let me give you an example.

What GitHub Actions are used in projects popular on HackerNews?

In principle, this should be simple! If we had this data in a SQL database, this query would just be a handful of JOIN and WHERE clauses — we should be able to write it in just a couple of minutes.

But we don't have this data in SQL. Instead, we have to cross-link the HackerNews API, the GitHub API, and a the custom GitHub Actions YAML file format.

We have to consider: REST / GraphQL / protobuf / other data format? Which client library is best? Are datetimes strings, or numbers showing seconds since epoch? Are numbers sent as strings because they don't fit in JSON? And many other problems...

We then have to worry about things like:

How is the API structured? Is it REST / GraphQL / protobuf / something else?
What client library is best for my programming language?
Are datetimes represented as RFC 3339 strings or seconds since epoch?
Are numbers sometimes sent as strings because they might be too big for the data format, like 64-bit integers in JSON?

And we haven't even gotten to actually running the query yet!

The end result is that this query becomes so annoyingly difficult that it might just not be worth doing it at all.

Agenda: Querying everything is valuable, and easier than it looks

Jump to this chapter in the video.

In this talk, I want to convince you of two things.

First, why query (almost) everything? We'll show a case study about never debugging the same problem twice from my (now-ex) employer, Kensho. Second, how to query (almost) everything? We'll perceive everything as a graph, then pattern-match on it. This will include running our "GitHub Actions in HackerNews submissions" query.

First, that querying almost everything is super valuable. For this, I'll show you a case study with my former I left Kensho at the end of 2022, after joining it as a tiny startup and seeing it grow over 7 years. But I gave this talk in April 2022, at which point Kensho was still my employer. employer, Kensho.

Second, that querying almost everything is really easy! For that, I'll show you a demo of the query we were just talking about.

Never debug the same problem twice

Jump to this chapter in the video.

At Kensho, we found that debugging the same problem over and over again is a huge waste of time. We decided to solve that.

Never debug the same problem twice, and how "query everything" makes it easy.

Debugging the same problem more than once is the byproduct of many engineering and social challenges we had to overcome.

Engineering challenges: many systems, many failure modes, cross-system interactions. Social challenges: many teams, people forget things, people switch teams and companies.

Systems fail in complicated ways, people forget things, etc. It isn't obvious how to use "team A debugged something" to prevent team B from ever having to debug the same thing in the future.

This requires having perfect institutional memory across the entire company, forever.

To solve the problem, we need perfect institutional memory across the entire company, forever. Solution: query for reoccurrences of problems, and prevent them.

Our solution is to query for reoccurrences of known problems, then prevent them.

We have queries looking for about 40 known problems. This was the correct number at the time I gave this talk. By the time I left Kensho around 6 months later, that number had grown substantially. We also keep adding one to two new queries per week.

Each time someone opens a pull request, we scan their code with all of these 40 queries. This takes about ~0.3s, even though we haven't done much optimization work to bring this number down further.

Our experience with querying: we deployed querying for known issues approximately 7 months ago (as of 2022 when this talk was originally given). About 40 queries have been added since then: around 1-2 queries per week. Each pull request gets scanned with all queries, which takes about a third of a second. Querying is easy for people of all backgrounds: product managers and analysts have written queries too!

That third of a second in extra CI time has saved countless hours from being wasted on debugging.

Even more importantly, detecting problems earlier frees up infra and SRE resources! This is a bit subtle, so bear with me: say you have a dozen teams deploying services to production. Then one of those services suffers an prod incident — who bears the cost of resolving it? The team that owns the service, but also the infra and SRE teams!

Now extend this over the course of months: different services suffer incidents, but infra and SRE bear a cost each time while the rest of the burden is spread across different teams.

This tool reduces the number of prod incidents by catching buggy code earlier, and preventing it from being merged and deployed at all. While each service-owning team is a bit faster and happier, infra and SRE teams experience a massive positive effect: they get paged and interrupted less often, they have more cycles to ship further improvements, and they are happier too! This sets up a glorious positive feedback loop: infra and SRE have more cycles, leading to more infra upgrades and more checks, which in turn decreases incidents further and leads to even more cycles for everyone!

We also found that everyone can be successful at writing queries — our engineers, product managers, and analysts all had success with it as well.

The net result: we ship bigger things, faster. Since nobody likes debugging, everyone is happier too! Another observation: this system changed our collective viewpoint from, loosely speaking, "incidents happen" to "wait, how could we have prevented that?" If preventing problems feels expensive, one wouldn't try to prevent them and instead learns to live with them. But when prevention suddenly became orders of magnitude cheaper, we found ourselves writing new checks after every incident or near-miss.

Example: Python version mismatch in `pyproject.toml` vs `Dockerfile`

Jump to this chapter in the video.

Let me give you an example of what these queries look like.

Diagram of a query. The query starts by considering Python services, and looking up the configured Python version in their pyproject.toml config file. Then, it looks up the Dockerfile with which that service deploys, and checks the Python version used by the base image of that Dockerfile. Finally, it looks for mismatches between the Python version in pyproject.toml versus the Dockerfile's installed Python.

At one point we had to debug a situation where a Python service was deployed with an unexpected version of Python, which didn't match the version its pyproject.toml file specified.

To prevent such incidents going forward, we wrote the following query: "Look for Python services whose pyproject.toml config file specifies one Python version, but the Dockerfile used to deploy that service specifies a different Python version." Any results of running this query were clear bugs, which were promptly fixed and prevented from ever coming back.

This is a query across three data sources:

our monorepo's service definitions files, written in JSON;
the TOML-based pyproject.toml project configuration file, and
the Dockerfile that governs how the service is deployed.

The same diagram from the previous image, annotated with where each data point came from. The Python service information originates from an internal JSON-based service definitions file. The pyproject.toml and its configured Python version come from the TOML-based project configuration file for that service. The Python version used by the deployment image is read from the service's Dockerfile.

It's impossible to catch this issue without looking at all three data sources at once! By themselves, they all seem fine — a valid service definition, a normal pyproject.toml file, and a reasonable Dockerfile.

The issue is not with any of them individually.

The issue is that they won't work together. Only a cross-domain query can spot that.

Example: Python type hints that are never checked

Jump to this chapter in the video.

Here's another example:

Query diagram showing a query over Python packages. It looks for packages whose CI test configuration shows that the type-checking step is skipped. It also looks up the package's source code, checking directories recursively, and looking at each included Python source code file to see if any function definitions include type hints.

At one point we had to debug an internal Python package that had incorrect type hints, because the package's CI configuration skipped mypy type-checking meaning that the type hints were never checked at all.

We then wrote the following query: "Look for Python packages whose test configuration says to not run type-checking, but whose source code directory (or any of its subdirectories) includes at least one Python file containing any code that specifies a type hint." Since those type hints are specified in the source code but are never checked, they might be wrong — so such code should not be allowed.

Any results produced by this query were clearly concerning, and we worked with the affected code owners to resolve them while also preventing this bug category from ever reoccurring.

Once again, this is a query across multiple data sources:

our monorepo's Python package definition files;
our git source control and file system data for the directory and file traversals, and
the parsed syntax trees of Python source code files, where we look for Python item definitions that include type hints.

The same diagram from the previous image, annotated with where each data point came from. The Python package and CI test configuration again come from an internal package definitions file. The source code directories and Python files come from git source control and local files on the filesystem. The function definitions and type hints require parsing Python source into an abstract syntax tree (AST).

Viewed individually, each of those components seemed fine.

The issue was only detectable when considering them as a whole.

Real-world queries across many data sources are easy and practical

Jump to this chapter in the video.

As of the release of this talk in 2022, more than a dozen different data sources were available for querying in Kensho's system. Once again, note that this is a talk I gave in 2022 while still employed there. I am no longer a Kensho employee, and cannot speak for my former employer.

More than a dozen data sources available at Kensho: git + filesystem, GitHub, Kubernetes, Dockerfiles, Python ASTs, pyproject.toml and setup.py files, dependencies and lockfiles, test configuration, etc. It's straightforward to plug in a new data source: great onboarding project for new engineers and interns.

We found that adding a new data source is super straightforward, and makes for a great onboarding project for new engineers and interns on the team.

It also had some additional unexpected benefits!

Unexpected additional benefits: the system is handy for ad-hoc queries too! For example: "which services haven't had any commits in at least one year?" or "which internal python packages aren't part of any lockfiles?" When these questions became easy to ask, we deleted nearly 300000 lines of unused code across our monorepo.

As soon as it became easy to write queries like "which internal Python packages are never used in any service," we discovered and deleted nearly 300,000 lines of unused code across our monorepo! Most dead code elimination tools can only find dead code within a single unit, like a library or a service. Queries allowed us to write inter-component analysis, and I got quite good at the "dead code elimination via queries" game! By the time I left Kensho about 6 months after giving this talk, I had the second-highest net-negative lines contributed metric: I had deleted more than 200k lines more than I had added. Not bad for 7 years of work! Now, you might be wondering: if I was second on the net-negative metric, who took first place? The answer: my manager, who once found and deleted a 600k line JSON file that someone had committed to the monorepo. I had no such silver bullets, so the gap was insurmountable :)

Basics of querying everything

Jump to this chapter in the video.

Having covered why querying almost everything is useful, let's see how to do it.

First, we need to perceive everything as a graph.

Perceive everything as a graph. A schema describes the dataset: vertices, properties, edges. Vertices have a type hierarchy: "GitHubRepository" is a flavor of "Webpage". This is equivalent to the relational model. A code example shows the definition of the "GitHubRepository" vertex type, defining it as implementing the "Webpage interface, with properties "url, owner, name" all of type non-nullable String, and edge "workflows" that points to zero, one, or more "GitHubWorkflow" neighboring vertices. The "url" property comes from the "Webpage" interface, and the "owner" and "name" properties are own properties of the GitHubRepository vertex type.

We'll have a schema that describes our dataset in terms of vertices, typed properties on those vertices, and edges between them. We'll also have a type hierarchy between the different vertex types, so we can say that a GitHubRepository vertex is a kind of Webpage.

This is equivalent to the relational (SQL) model. If you're comfortable with SQL, you'll feel right at home here: vertices are tables, properties are columns, edges are foreign keys, etc.

Here's what our "HackerNews meets GitHub" graph looks like:

Graph diagram showing a series of vertices having certain property values, and connected to each other by edges. There are two vertices of type HackerNewsStory, each with different scores and titles. One of them is connected to a Webpage vertex with some URL, while the other is connected to a GitHubRepository (a subtype of Webpage) with a different URL, and values for the "name" and "owner" properties as required by the schema. The latter HackerNewsStory vertex also has some comments, depicted as a tree of Comment vertices where the root comments are connected to the HackerNewsStory, and replying comments are connected to the comment to which they are replying.

We have some HackerNews stories that have titles and scores. Some of them have comments. The stories point to web pages, some of which are GitHub repositories, and so on.

In order to run a query against this graph, we need to do two things: define a pattern to look for, then select a set of properties to return from the vertices that match our pattern.

For example, let's look for trending HackerNews stories with a score of at least 50 points, which also point to a GitHub repository. For each match, say we want to know the GitHub repository URL and its story's HackerNews score.

The same graph diagram as before, updated to include a query structure we are attempting to pattern-match against the graph. The query looks for HackerNewsStory vertices with a score of more than 50 points, and connected to a GitHubRepository vertex. Both HackerNewsStory vertices have more than 50 points, but only one of them is connected to a GitHubRepository whereas the other points to a generic Webpage. The matching HackerNewsStory and GitHubRepository vertices are the result of the query, and their selected properties are returned as the query's outputs.

Taking a look at our data graph, we discover some vertices that match our pattern. We then output the values of the properties we were interested in.

Here's what this query looks like in Trustfall's query syntax:

The query is the following: { HackerNewsTop(max: 30) { ... on HackerNewsStory { score @filter(op: ">", value: ["$min_score"]) @output link { ... on GitHubRepository { url @output } } } } } We provide the query with the value 50 for its "min_score" variable. The "... on HackerNewsStory" clause selects only stories, discarding other kinds of HackerNews items such as job postings. The "@filter" clause applies the "greater than 50 points" predicate on the story's score. The "link" edge on HackerNews stories points to a vertex of type "Webpage", but we are only interested in the Webpage subtype "GitHubRepository" — we select only those by using the "... on GitHubRepository" clause. Since the Trustfall engine is pluggable, its front end can be swapped out to support any alternative query syntax as well.

First, we start with the current top submissions on HackerNews. Then:

We select only story submissions, discarding other items such as job postings.
We make sure the score is at least 50 points.
We ensure that the story's link points to a GitHub repository, and not to just some arbitrary web page.
We output the score and the repository URL.

Demo: What GitHub Actions are used in projects popular on HackerNews?

Jump to this chapter in the video.

Let's demo Trustfall end-to-end by running the query that seemed so complicated at the start of the talk!

Call back to the query diagram in an earlier slide. It shows the data sources and connections needed to answer the query: "What GitHub Actions are used in projects popular on HackerNews?" Querying begins in the HackerNews API by loading the HackerNews front page, and moves to looking up its stories. Then, we switch to querying the GitHub API, resolving repositories and their Actions workflows. Finally, we have to parse workflow YAML files to extract their jobs, and from within, the GitHub Actions invocations we wanted to look up.

Written in Trustfall syntax, it's a bit more complex than our earlier example — but it's just more of the same operations we already looked at.

We again start with the top submissions on HackerNews, discarding everything that isn't a story submission. We filter the score just as before. Then we navigate across all the edges we need until we get to the GitHub Actions invoked by that repository's workflows.

The Trustfall query syntax used to run the aforementioned query. The query is quite complex and can be found at the following link: https://github.com/obi1kenobi/trustfall/blob/main/demo-hytradboi/example_queries/actions_in_repos_with_min_10_hn_pts.ron#L3-L31

Enough talk — let's see it run! (You can try it yourself by grabbing the code from GitHub!)

Demo animation of running that query. It queries both HackerNews and GitHub's APIs, and parses the discovered workflow YAML files. It returns the first 20 results in 6.36 seconds.

Click to replay the animation above

In just a couple of seconds, our query successfully executed dozens of requests against the HackerNews and GitHub APIs on our behalf, and returned all the data we requested. Our demo was hardcoded to return only 20 results, but Trustfall queries are lazily evaluated so it isn't necessary to specify a particular number of desired results up front. Running a query returns an iterator that lazily yields query results. If the iterator is exhausted, all possible results were retrieved. If a partial result is sufficient, one may stop advancing the iterator at any time.

Let's take a deeper look at the data we got back.

First, we see that we got data for multiple different GitHub repositories, each of which got a different number of points on HackerNews.

A closer look at a subset of the returned results, showing that multiple HackerNews submissions pointing to different GitHub repositories were returned. This proves that the query can successfully resolve multiple HackerNewsStory vertices from the HackerNews front page.

Some of the repositories have multiple jobs within the same GitHub Actions workflow file.

A closer look at a subset of the returned results, showing that multiple jobs within one of the produced YAML workflow files matched our query. This proves that the query can successfully resolve multiple jobs vertices for a given workflow.

Some repositories also have multiple different workflow files.

A closer look at a subset of the returned results, showing that multiple workflow files were discovered in one of the matching GitHub repositories. This proves that the query can successfully resolve multiple workflow vertices for a given GitHub repository.

The diversity in the returned data shows how our query successfully captured the semantics of the domain we were querying.

How Trustfall executes queries

Jump to this chapter in the video.

The best way to think about Trustfall is that it's like "LLVM for databases." It's structured like a traditional compiler:

How queries execute: a query is accepted by Trustfall's current front-end, which transforms it into intermediate representation (IR). The IR is then lazily evaluated by an interpreter. The interpreter is generic — it can operate over any data source — by virtue of having a swappable adapter plugin component that connects it to the underlying API or data format.

Incoming queries are lexed, parsed, and translated into intermediate representation (IR) in the front end of the compiler.

The IR represents the query without being tied either to the original syntax nor to a specific data source or manner of executing that query.

Currently, the IR is executed by a lazily-evaluated interpreter. Traditionally, this is called a compiler "backend," since it consumes IR and speaks to an underlying system.

Trustfall is a second-generation system and a Rust rewrite of its Python-based predecessor, graphql-compiler, which saw widespread use at Kensho for many years. This post from Kensho's blog shows off some of graphql-compiler's capabilities: it could compile directly to several different database query languages, such as OrientDB MATCH and the SQL dialects for Postgres, Oracle, and Microsoft SQL Server. Its backends also supported estimating query costs, automatic query parallelization and pagination, and many other cool tricks.

Unfortunately, being a fairly old Python project (originally started in Python 2!) meant that the graphql-compiler codebase had accumulated a fair bit of technical debt. Rewriting it in Rust under the name Trustfall offered an opportunity to revisit old design decisions with almost 10 years of hindsight, and substantially improve maintainability and performance in the process.

The same diagram as in the earlier slide, augmented with some possible future extensions: a future front-end that accepts alternative query languages, and a future back-end that executes IR by, for example, compiling it directly into SQL.

While Trustfall doesn't currently implement all the bells and whistles of graphql-compiler, its design is capable of supporting them in the future. We could add alternative front-ends to accept different query languages, such as GraphQL or SQL. We could also add alternative backends, such as ones that execute queries by compiling them to SQL first.

Hence, like an LLVM for databases.

Why plugging in a new dataset is easy

Jump to this chapter in the video.

All components we've described so far are completely agnostic of the specific data source providing the data for our queries. They are sizable — tens of thousands of lines of code — but they are "write once, use anywhere!"

The same diagram as in the earlier slide, with additional annotations on top. The front-end, IR, and backends such as the interpreter are all agnostic of the dataset being queried. With on the order of 50000 lines of code, they represent the bulk of the code in the system — but that code is "write once, use anywhere." The adapter component is the only part specific to the given dataset, representing the "plumbing" for the underlying API client or data format. This component is usually on the order of 500 lines of code, and can often be partially or mostly autogenerated from machine-readable format descriptions such as OpenAPI specifications.

The only component that isn't dataset-agnostic is the adapter module, which acts as plumbing between the interpreter and the underlying API or data file format which contains the data being queried.

The adapter is a massively smaller and simpler piece of code than Trustfall itself — usually, only a few hundred lines of code. It provides a schema — the dataset's vertex types, properties, and edges — which can often be autogenerated e.g. from a database schema or an OpenAPI specification. For example, the rustdoc JSON adapter that powers cargo-semver-checks can be found in this repository. Its schema is here, and it needs to implement the Trustfall Adapter API defined here. Its Adapter implementation is here. This is one of the more complex adapter implementations, since it needs to precisely model most of the Rust language's semantics for purposes of semver-checking! It was not built monolithically all at once, and instead grew in scope organically over several years.

Then, the adapter implements four functions that cover operations over that schema:

Resolve the "entrypoint" of a query, getting an initial set of vertices to work with. For example: fetching the current top-ranked submissions on HackerNews.
Resolve the value of a specific property on a vertex, like getting a HackerNews submission's score.
Resolve an edge, starting from a vertex and getting its neighboring vertices along that edge. For example, getting the Webpage vertex (if any) that a HackerNews story points to.
Resolve a vertex subtyping relationship analogous to Python's isinstance() operator, such as answering "Is this Webpage vertex actually a GitHubRepository vertex?"

Why plugging in a new dataset is easy: we need to write or auto-generate a schema, and implement four functions that make up the Adapter interface. One of the four functions is used to resolve the entrypoints into the query and produce an initial set of vertices from where querying begins — for example, getting the HackerNews items on the front page. Another function resolves the property values of vertices, such as the score of a given HackerNews story. A third resolves edges, such as the "link" edge of HackerNews stories, and returns the neighboring vertices along that edge. The final, fourth function, performs an "isinstance()" vertex subtyping check, such as "is this Webpage vertex actually a GitHubRepository?"

Having written or auto-generated a schema, and having implemented these four functions, we're ready to query any dataset with a fast, modern query language!

We've only scratched the surface!

Jump to this chapter in the video.

Having all our data available in a single query system gives us a tremendous amount of leverage to build interesting tools. In this talk, we've only just scratched the surface of what is possible with this technology. Today, the cargo-semver-checks linter for semantic versioning in Rust is another example of the possibilities Trustfall unlocks.

I'll give a couple of brief examples of some of the more unexpected use cases.

Our examples so far have only scratched the surface of what Trustfall can do! To give you a small taste of what else it can do, it can also be used to query schemas and answer questions like "what vertex types have string-typed properties whose names end in 'date'?" It can also be used to query queries, or query plans, such as asking "which of our queries have filters on property X of type Y?" It also offers record-and-replay query tracing for easy query debugging and auditability, and much more!

One of them is our ability to query schemas, queries, and query plans.

Just like we can query the data behind an API, we can also query the shape of the data itself. We can ask questions like:

Which of our queries apply filters on property X of type Y?
Which types in our schema have string-typed fields with names ending in date?

The other example is that we can use record-and-replay query tracing to help us understand, audit, and debug our queries.

In fact, this is a big part of how Trustfall's own test suite works! We record traces that capture every operation performed as part of executing all our test queries, then replay the traces against our interpreter as a snapshot test to ensure its behavior hasn't changed. The traces are a linearized, human-readable record that shows which data was produced when, and are perfect for ensuring the correctness of Trustfall and its adapters.

If you'd like to dig deeper, check out:

Trustfall's GitHub page: https://github.com/obi1kenobi/trustfall
The demo code for this talk: https://github.com/obi1kenobi/trustfall/tree/main/demo-hytradboi
Our playgrounds, where you can query HackerNews APIs or the contents of popular Rust packages
cargo-semver-checks, a linter for semantic versioning in Rust written using Trustfall and the subject of other talks I've given

Trustfall gives us an unprecedented amount of leverage, unlike anything open-source tools could call upon before. It's a foundational piece of infrastructure for powerful tools, and so far, we've only just scratched the surface!

If you liked this post, consider subscribing to my blog or following me on Mastodon, Bluesky, or Twitter/X. I'm giving a related talk at the UA Rust Conference 2024, so be sure to check it out!

How to Query (Almost) Everything

Outline

Problem: Querying non-database data is hard

Agenda: Querying everything is valuable, and easier than it looks

Never debug the same problem twice

Example: Python version mismatch in pyproject.toml vs Dockerfile

Example: Python type hints that are never checked

Real-world queries across many data sources are easy and practical

Basics of querying everything

Demo: What GitHub Actions are used in projects popular on HackerNews?

How Trustfall executes queries

Why plugging in a new dataset is easy

We've only scratched the surface!

Example: Python version mismatch in `pyproject.toml` vs `Dockerfile`