A beginner-friendly introduction to compilers: follow along as we build a compiler from scratch, or fork the code on GitHub and add your own optimizations too! In this episode: value numbering helps track how values are used in the program.

Last time on Compiler Adventures, we implemented constant propagation: if `x = 2` and `y = 3` then `add x y` produces `5`. While it did help detect a few no-ops, we saw the optimization quickly become ineffective as it got overwhelmed by too many `Unknown` values.

In this episode, we'll equip our compiler with the key tool for reasoning about those `Unknown` values. The starting code for this episode is on GitHub, on branch `part3`.

Let's begin by solving the challenge from the last episode. When examining the following program, our compiler claimed that the program ends with an `Unknown` value in the `w` register: I'm introducing a small change in notation to avoid confusion: instead of `Input(i)`, from now on I'll write `Input_i`. This will help us avoid confusing "i-th program input" with "input to the program with value `i`." When we know a program value represents the number `i`, we'll continue to use the notation `Exact(i)`.

Let's simulate running this program:

• When the program starts, all registers are set to zero: `w = x = y = z = 0`.
• For `inp w`, say the input number is 5, so now `w = 5`.
• `add x w` computes `x = x + w` with `x = 0, w = 5`, so we get `x = 5`.
• `eql w x` checks `w` and `x` for equality. They are both set to 5, so we get `w = 1`.

Try simulating again with a different number as an input for `inp w`. Does the final value of `w` change? No — it's always 1! What's going on here?

Let's zoom in on the second-to-last instruction: `add x w`. The values that instruction operates on are `w = Input_0` i.e. "the first program input value," and `x = Exact(0)` i.e. the number zero. After performing its operation `x = x + w`, our code from the previous episode decides the result is `x = Unknown`. This is technically true! We can't know what value `x` has exactly. But we can do better than `Unknown`!

We know `x` was zero before `w` got added to it: `x = 0 + w`, so it must be the case that `x = w = Input_0` after the instruction. Imagine our compiler code was able to determine this as well, and thus computed `x = Input_0` as the outcome of the addition instead of `x = Unknown`. Continuing to the next instruction: `eql w x` then compares `w = Input_0` and `x = Input_0`. Those are clearly the same value! So the result of `eql w x` would be `w = Exact(1)`, just as our hand-simulation of the program showed!

The key observation: `Input_0` represents a specific fixed value. We can't know what number it represents until we run the program, but for a given execution of the program, it's always going to be the same fixed number: the first number in the input. This is why we can know that `w = Input_0` and `x = Input_0` are equal to each other, and therefore `eql w x` results in an output of `w = Exact(1)`.

We managed to evaluate the `eql` instruction over values we don't know exactly! Neat, right?

Could we do the same thing with `Unknown` values as well?

## What is known about Unknown?

Let’s set up an equivalent example to the last episode’s challenge, but with an `eql w x` instruction that evaluates two `Unknown` values instead of two inputs.

Say we have some long program that’s been running for a while, and all the registers currently have `Unknown` values. The next instructions in the program do the following:

• zero out register `x` by multiplying its value by 0, then
• make `x` also hold whatever value was in `w` by running `add x w`, and finally
• compare `w` and `x` for equality using `eql w x` as before.

Using the same reasoning as earlier in the post, we see that `x` and `w` must again have the same value at the point where they get compared, so the result of the equality check instruction should be `w = Exact(1)`. And yet, looking at the last row of the printout, our compiler claims that comparing `w = Unknown` to `x = Unknown` produces an output of `Unknown`. What’s up with that?

Right now, `Unknown` in our compiler means “we know absolutely nothing about this value.” It does not refer to a specific-yet-unknown value, so we can’t say for sure if two `Unknown` values are equal or not equal to each other. This is unlike input values, where `Input_0` refers specifically to the first program input, `Input_1` refers to the second program input, etc.

The registers `w = Input_0` and `x = Input_0` compare equal to each other because they both refer to the same fixed but still-unknown value. When our compiler evaluated `add x w` for `x = Exact(0)` and `w = Unknown`, it found the result was `x = Unknown` — but it had no way to represent that this is the same `Unknown` that is also currently in `w`.

Just like we currently number program input values as `Input_0, Input_1` and so on, what if we tried numbering every value we encounter in the program, including `Unknown` ones? Then, we could express that two registers hold the same `Unknown` value: e.g. “the fifth value in the program, which is `Unknown`.” For convenience, we’ll use the notation `5: Unknown` to describe that “fifth program value, which is `Unknown`.”

We’ll apply this numbering scheme to all three kinds of values our compiler may encounter in the program: `Exact, Input, Unknown`. We’ll use a simple counter to ensure each new value gets its own unique number — let’s call this identifier a `Vid`. Whenever we know that an operation preserves the value of one of its operands (like adding a number to zero), we’ll keep the same value with the same `Vid` instead of making a new value with a new `Vid`.

## Implementing our idea

To help us skip straight to the interesting parts, I've added about 40 lines of boilerplate and helper code behind the scenes. We now have a `Program` type that keeps track of which `Vid` numbers we've already used, and has three helper methods on it called `new_exact_value(), new_unknown_value(), new_input_value()` which ensure we correctly assign unique `Vid` numbers to newly-created values.

First, let's implement the `Vid` type that represents the unique value IDs, and insert it into our existing `Value` enum's variants.

In the previous episode, we defined the `Value` enum like this:

Excerpt from `optimization.rs`. See this code in context in the GitHub repo.

Let's change it to the following:

Excerpt from `values.rs`. See this code in context in the GitHub repo.

When considering two `Value` entities in a program, when can our compiler know for sure that they must represent the same number? It's in one of these cases:

• if both are `Value::Exact` variants representing the same known constant number, or
• if both of them have the same `Vid`.

Importantly, if two `Value` entities have different `Vid` fields, this does not mean they must always hold different numbers. Different `Vid` fields means that we don't know one way or the other whether the values are equivalent. For example, `5: Input_0` and `6: Input_1` have different `Vid` fields: `5` and `6`. Depending on the inputs given to any run of the program, they may sometimes but not always happen to represent the same number.

Let's implement this logic in the `==` operator for `Value`:

Excerpt from `values.rs`. See this code in context in the GitHub repo.

We can now simplify our no-op detection code to take advantage of our new ability to check values for equality. Instead of looking for a hardcoded set of instructions and values, we can directly compare the register values before and after each instruction. If an instruction did not change the value of its target register and had no other side-effects, that instruction is a no-op by definition. We have to consider instruction side-effects here. For example, reading program input via an `inp` instruction affects state that is outside the `w, x, y, z` registers: the next number that will be read by the next `inp` instruction. Eliminating an `inp` instruction would cause subsequent `inp` to read different program inputs. Fortunately, our implementation already guarantees that `inp` instructions can never be considered no-ops: they are assigned a brand-new `Vid` which could not possibly have been shared with the value previously in the destination register. If a future extension to the MONAD language adds more instructions, we may have to update our no-op detection code. For example, consider a hypothetical `print x` instruction that outputs the value of the selected register `x`: it doesn't change the value of any register, and yet it isn't a no-op because of the side-effect of printing out a number.

So we can completely remove the `is_instruction_no_op()` function we previously wrote, and replace its use inside `constant_propagation()` with a simple equality check between the old and new values of the destination register of the current instruction:

Excerpt from `optimization.rs`. See this code in context in the GitHub repo.

`constant_propagation()` will also require a small refactor to generate `Vid` data for each `Value`. This is straightforward and uses the convenience functions I earlier mentioned I added "off screen." Here's what our final `constant_propagation()` implementation looks like:

Excerpt from `optimization.rs`. See this code in context in the GitHub repo.

Finally, we'll need to update `evaluate_instruction()` to help it avoid generating new `Vid` entries whenever it's possible to reuse one of the values on which the instruction operates. Recall the situation from the challenge program we're solving this week:

We want `evaluate_instruction()` applied to `add x w` to realize that the result in `x` is going to be that same `Input_0` value that is also in `w`, and not some new value for which nothing is known. Each of the possible instructions has different special cases, so we'll split `evaluate_instruction()` into per-instruction cases:

Excerpt from `optimization.rs`. See this code in context in the GitHub repo.

Then, we'll implement the per-instruction functions, each with their own special cases. For example, here's `evaluate_add()`:

Excerpt from `optimization.rs`. See this code in context in the GitHub repo.

Here's `evaluate_mul()`, which has a few more cases to consider:

Excerpt from `optimization.rs`. See this code in context in the GitHub repo.

Most of the other instructions follow the same pattern, so I won't include all of them in this post — you can find them all in the GitHub repo. The exception is `evaluate_equal()` which has just a touch more nuance:

Excerpt from `optimization.rs`. See this code in context in the GitHub repo.

And like that, we've implemented value numbering. Specifically, we've implemented global value numbering. The difference between global and local value numbering is whether the numbered values apply to the entire program, or are valid only within a given function, if-statement, loop, or similar program component (called a "basic block"). The MONAD language does not have any functions, if-statements, or loops — making the entire program a single basic block — so our value numbering is global because it applies to the entire program. This simplicity is part of why this series chooses MONAD as the language to compile and optimize: its simplicity makes it easier to explain compiler ideas one at a time without requiring a lot of prerequisite knowledge. Case in point: basic blocks are a fundamental idea in compilers and a prerequisite for building a compiler for any language like Python or Rust or C/C++, yet three episodes into the blog series, the first mention of the term "basic block" is in this entirely-optional footnote. As expected, our compiler now figures out that our challenge program ends with the number `1` in the `x` register instead of an `Unknown` value! 🎉

We're again interested in the "real-world" impact of our optimization, as measured on the standard Advent of Code MONAD input program.

First, what kind of impact do we expect to find?

Without value numbering, in our previous Compiler Adventure we saw our compiler was completely unable to reason about any value that is not `Exact`. Operations on `Input` values would produce `Unknown` values as we saw in the challenge program, and `Unknown` represented a total absence of any information at all.

In fact, the lack of value numbering meant our compiler was incapable of even simple reasoning like "registers don't change their value unless they are written to." For example, say the value in register `w` was `Unknown` and the compiler was considering an instruction that did not use `w`, such as `mul x y`. The value of `w` after that instruction would remain `Unknown` — but since `Unknown` values had no way to be compared for equality, even the "before" and "after" `Unknown` values of `w` could not be considered equal.

This severely limited our compiler's ability to optimize MONAD code in the previous episode. In this episode, we've helped our compiler gain more insight about how `Unknown` values are used, but there aren't many cases where value numbering by itself allows the compiler to eliminate instructions and optimize a program. We haven't yet implemented any of the powerful optimizations that value numbering makes possible, but we've laid all the groundwork we'll rely on in the next episodes! So we don't expect to see much if any speedup just yet, and we should find a better way to measure our work.

A good metric for our work would capture how much difference adding the compiler equivalent of object permanence makes. In other words, when our compiler sees a non-`Exact` value, how often does it know that value to be identical to another value used elsewhere in the program?

Without value numbering, the answer would be "never" i.e. 0% of the time. Let's update our compiler to measure these new stats:

``````\$ cargo run registers sample_programs/aoc_challenge.txt
< ... snip ... >
Total non-input instructions: 238
- with 1+ non-exact value:    218 (91.6%)
- without any exact values:    53 (22.3%)

Total non-exact values uses: 462
- number of unique values: 191 (41.3%)
- non-unique uses: 271 (58.7%)
``````

In the Advent of Code MONAD program, instructions reference or produce an `Unknown` or `Input` value 462 times. Value numbering determined that there are no more than 191 (41.3%) unique `Unknown` or `Input` values in the program; the remaining 271 uses (58.7%) are guaranteed to be reuses of a value already in existence elsewhere. Each of these repeated uses is an opportunity for future optimizations! For now, that's a job well done!

## Wrap up + challenge question

The completed code from this post is on branch `part3_finished` on GitHub.

In the meantime, consider the `eql_2_on_eql_result` program below, which is also available on GitHub in the `sample_programs` directory:

``````inp w
inp x
eql w x
eql w 2
``````

What is the value of `w` at the end of the program? Our compiler says it's `9: Unknown` — can you do better?

All this and more in the next Compiler Adventure — see you next time!

If you liked this post, please share it with friends — compilers are for everyone! Reach out to me on Twitter or join the discussion on HackerNews.

Thanks to Paul Hemberger, James Logan, Russell Cohen, and Vincent Tjeng for their feedback on drafts of this post. Any mistakes are mine alone.