A beginner-friendly introduction to compilers: follow along as we build a compiler from scratch, or fork the code on GitHub and add your own optimizations too! In this episode: value numbering helps track how values are used in the program.

Last time on Compiler Adventures, we implemented constant propagation: if x = 2 and y = 3 then add x y produces 5. While it did help detect a few no-ops, we saw the optimization quickly become ineffective as it got overwhelmed by too many Unknown values.

In this episode, we'll equip our compiler with the key tool for reasoning about those Unknown values. The starting code for this episode is on GitHub, on branch part3.

Dark wispy clouds over a row of massive radio telescope dishes stretching toward the horizon, their white paint standing out against the background in the fading light. A cloudy twilight over the giant radio telescopes of the Very Large Array, which has been helping humanity peer into the depths of the unknown for more than 40 years. In this episode, we'll also do some peering into Unknown. Source: NRAO/AUI/NSF, CC BY 3.0

Let's begin by solving the challenge from the last episode. When examining the following program, our compiler claimed that the program ends with an Unknown value in the w register: I'm introducing a small change in notation to avoid confusion: instead of Input(i), from now on I'll write Input_i. This will help us avoid confusing "i-th program input" with "input to the program with value i." When we know a program value represents the number i, we'll continue to use the notation Exact(i).

Let's simulate running this program:

Try simulating again with a different number as an input for inp w. Does the final value of w change? No — it's always 1! What's going on here?

Let's zoom in on the second-to-last instruction: add x w. The values that instruction operates on are w = Input_0 i.e. "the first program input value," and x = Exact(0) i.e. the number zero. After performing its operation x = x + w, our code from the previous episode decides the result is x = Unknown. This is technically true! We can't know what value x has exactly. But we can do better than Unknown!

We know x was zero before w got added to it: x = 0 + w, so it must be the case that x = w = Input_0 after the instruction. Imagine our compiler code was able to determine this as well, and thus computed x = Input_0 as the outcome of the addition instead of x = Unknown. Continuing to the next instruction: eql w x then compares w = Input_0 and x = Input_0. Those are clearly the same value! So the result of eql w x would be w = Exact(1), just as our hand-simulation of the program showed!

The key observation: Input_0 represents a specific fixed value. We can't know what number it represents until we run the program, but for a given execution of the program, it's always going to be the same fixed number: the first number in the input. This is why we can know that w = Input_0 and x = Input_0 are equal to each other, and therefore eql w x results in an output of w = Exact(1).

We managed to evaluate the eql instruction over values we don't know exactly! Neat, right?

Could we do the same thing with Unknown values as well?

What is known about Unknown?

Let’s set up an equivalent example to the last episode’s challenge, but with an eql w x instruction that evaluates two Unknown values instead of two inputs.

Say we have some long program that’s been running for a while, and all the registers currently have Unknown values. The next instructions in the program do the following:

Using the same reasoning as earlier in the post, we see that x and w must again have the same value at the point where they get compared, so the result of the equality check instruction should be w = Exact(1). And yet, looking at the last row of the printout, our compiler claims that comparing w = Unknown to x = Unknown produces an output of Unknown. What’s up with that?

Right now, Unknown in our compiler means “we know absolutely nothing about this value.” It does not refer to a specific-yet-unknown value, so we can’t say for sure if two Unknown values are equal or not equal to each other. This is unlike input values, where Input_0 refers specifically to the first program input, Input_1 refers to the second program input, etc.

The registers w = Input_0 and x = Input_0 compare equal to each other because they both refer to the same fixed but still-unknown value. When our compiler evaluated add x w for x = Exact(0) and w = Unknown, it found the result was x = Unknown — but it had no way to represent that this is the same Unknown that is also currently in w.

Just like we currently number program input values as Input_0, Input_1 and so on, what if we tried numbering every value we encounter in the program, including Unknown ones? Then, we could express that two registers hold the same Unknown value: e.g. “the fifth value in the program, which is Unknown.” For convenience, we’ll use the notation 5: Unknown to describe that “fifth program value, which is Unknown.”

We’ll apply this numbering scheme to all three kinds of values our compiler may encounter in the program: Exact, Input, Unknown. We’ll use a simple counter to ensure each new value gets its own unique number — let’s call this identifier a Vid. Whenever we know that an operation preserves the value of one of its operands (like adding a number to zero), we’ll keep the same value with the same Vid instead of making a new value with a new Vid.

Implementing our idea

To help us skip straight to the interesting parts, I've added about 40 lines of boilerplate and helper code behind the scenes. We now have a Program type that keeps track of which Vid numbers we've already used, and has three helper methods on it called new_exact_value(), new_unknown_value(), new_input_value() which ensure we correctly assign unique Vid numbers to newly-created values.

First, let's implement the Vid type that represents the unique value IDs, and insert it into our existing Value enum's variants.

In the previous episode, we defined the Value enum like this:

Excerpt from optimization.rs. See this code in context in the GitHub repo.

Let's change it to the following:

Excerpt from values.rs. See this code in context in the GitHub repo.

When considering two Value entities in a program, when can our compiler know for sure that they must represent the same number? It's in one of these cases:

Importantly, if two Value entities have different Vid fields, this does not mean they must always hold different numbers. Different Vid fields means that we don't know one way or the other whether the values are equivalent. For example, 5: Input_0 and 6: Input_1 have different Vid fields: 5 and 6. Depending on the inputs given to any run of the program, they may sometimes but not always happen to represent the same number.

Let's implement this logic in the == operator for Value:

Excerpt from values.rs. See this code in context in the GitHub repo.

We can now simplify our no-op detection code to take advantage of our new ability to check values for equality. Instead of looking for a hardcoded set of instructions and values, we can directly compare the register values before and after each instruction. If an instruction did not change the value of its target register and had no other side-effects, that instruction is a no-op by definition. We have to consider instruction side-effects here. For example, reading program input via an inp instruction affects state that is outside the w, x, y, z registers: the next number that will be read by the next inp instruction. Eliminating an inp instruction would cause subsequent inp to read different program inputs. Fortunately, our implementation already guarantees that inp instructions can never be considered no-ops: they are assigned a brand-new Vid which could not possibly have been shared with the value previously in the destination register. If a future extension to the MONAD language adds more instructions, we may have to update our no-op detection code. For example, consider a hypothetical print x instruction that outputs the value of the selected register x: it doesn't change the value of any register, and yet it isn't a no-op because of the side-effect of printing out a number.

So we can completely remove the is_instruction_no_op() function we previously wrote, and replace its use inside constant_propagation() with a simple equality check between the old and new values of the destination register of the current instruction:

Excerpt from optimization.rs. See this code in context in the GitHub repo.

constant_propagation() will also require a small refactor to generate Vid data for each Value. This is straightforward and uses the convenience functions I earlier mentioned I added "off screen." Here's what our final constant_propagation() implementation looks like:

Excerpt from optimization.rs. See this code in context in the GitHub repo.

Finally, we'll need to update evaluate_instruction() to help it avoid generating new Vid entries whenever it's possible to reuse one of the values on which the instruction operates. Recall the situation from the challenge program we're solving this week:

We want evaluate_instruction() applied to add x w to realize that the result in x is going to be that same Input_0 value that is also in w, and not some new value for which nothing is known. Each of the possible instructions has different special cases, so we'll split evaluate_instruction() into per-instruction cases:

Excerpt from optimization.rs. See this code in context in the GitHub repo.

Then, we'll implement the per-instruction functions, each with their own special cases. For example, here's evaluate_add():

Excerpt from optimization.rs. See this code in context in the GitHub repo.

Here's evaluate_mul(), which has a few more cases to consider:

Excerpt from optimization.rs. See this code in context in the GitHub repo.

Most of the other instructions follow the same pattern, so I won't include all of them in this post — you can find them all in the GitHub repo. The exception is evaluate_equal() which has just a touch more nuance:

Excerpt from optimization.rs. See this code in context in the GitHub repo.

And like that, we've implemented value numbering. Specifically, we've implemented global value numbering. The difference between global and local value numbering is whether the numbered values apply to the entire program, or are valid only within a given function, if-statement, loop, or similar program component (called a "basic block"). The MONAD language does not have any functions, if-statements, or loops — making the entire program a single basic block — so our value numbering is global because it applies to the entire program. This simplicity is part of why this series chooses MONAD as the language to compile and optimize: its simplicity makes it easier to explain compiler ideas one at a time without requiring a lot of prerequisite knowledge. Case in point: basic blocks are a fundamental idea in compilers and a prerequisite for building a compiler for any language like Python or Rust or C/C++, yet three episodes into the blog series, the first mention of the term "basic block" is in this entirely-optional footnote. As expected, our compiler now figures out that our challenge program ends with the number 1 in the x register instead of an Unknown value! 🎉

We're again interested in the "real-world" impact of our optimization, as measured on the standard Advent of Code MONAD input program.

First, what kind of impact do we expect to find?

Without value numbering, in our previous Compiler Adventure we saw our compiler was completely unable to reason about any value that is not Exact. Operations on Input values would produce Unknown values as we saw in the challenge program, and Unknown represented a total absence of any information at all.

In fact, the lack of value numbering meant our compiler was incapable of even simple reasoning like "registers don't change their value unless they are written to." For example, say the value in register w was Unknown and the compiler was considering an instruction that did not use w, such as mul x y. The value of w after that instruction would remain Unknown — but since Unknown values had no way to be compared for equality, even the "before" and "after" Unknown values of w could not be considered equal.

This severely limited our compiler's ability to optimize MONAD code in the previous episode. In this episode, we've helped our compiler gain more insight about how Unknown values are used, but there aren't many cases where value numbering by itself allows the compiler to eliminate instructions and optimize a program. We haven't yet implemented any of the powerful optimizations that value numbering makes possible, but we've laid all the groundwork we'll rely on in the next episodes! So we don't expect to see much if any speedup just yet, and we should find a better way to measure our work.

A good metric for our work would capture how much difference adding the compiler equivalent of object permanence makes. In other words, when our compiler sees a non-Exact value, how often does it know that value to be identical to another value used elsewhere in the program?

Without value numbering, the answer would be "never" i.e. 0% of the time. Let's update our compiler to measure these new stats:

$ cargo run registers sample_programs/aoc_challenge.txt
< ... snip ... >
Total non-input instructions: 238
- with 1+ non-exact value:    218 (91.6%)
- without any exact values:    53 (22.3%)

Total non-exact values uses: 462
- number of unique values: 191 (41.3%)
- non-unique uses: 271 (58.7%)

In the Advent of Code MONAD program, instructions reference or produce an Unknown or Input value 462 times. Value numbering determined that there are no more than 191 (41.3%) unique Unknown or Input values in the program; the remaining 271 uses (58.7%) are guaranteed to be reuses of a value already in existence elsewhere. Each of these repeated uses is an opportunity for future optimizations! For now, that's a job well done!

Wrap up + challenge question

The completed code from this post is on branch part3_finished on GitHub.

In the meantime, consider the eql_2_on_eql_result program below, which is also available on GitHub in the sample_programs directory:

inp w
inp x
eql w x
eql w 2

What is the value of w at the end of the program? Our compiler says it's 9: Unknown — can you do better?

All this and more in the next Compiler Adventure — see you next time!

If you liked this post, please share it with friends — compilers are for everyone! Reach out to me on Twitter or join the discussion on HackerNews.

Thanks to Paul Hemberger, James Logan, Russell Cohen, and Vincent Tjeng for their feedback on drafts of this post. Any mistakes are mine alone.