A beginner-friendly introduction to compilers: follow along as we build a compiler from scratch, or fork the code on GitHub and add your own optimizations too! In this episode: value numbering helps track how values are used in the program.
Last time on Compiler Adventures, we implemented constant propagation: if x = 2
and y = 3
then add x y
produces 5
. While it did help detect a few no-ops, we saw the optimization quickly become ineffective as it got overwhelmed by too many Unknown
values.
In this episode, we'll equip our compiler with the key tool for reasoning about those Unknown
values. The starting code for this episode is on GitHub, on branch part3
.
Let's begin by solving the challenge from the last episode. When examining the following program, our compiler claimed that the program ends with an Unknown
value in the w
register:
I'm introducing a small change in notation to avoid confusion: instead of Input(i)
, from now on I'll write Input_i
. This will help us avoid confusing "i-th program input" with "input to the program with value i
." When we know a program value represents the number i
, we'll continue to use the notation Exact(i)
.
Let's simulate running this program:
- When the program starts, all registers are set to zero:
w = x = y = z = 0
. - For
inp w
, say the input number is 5, so noww = 5
. add x w
computesx = x + w
withx = 0, w = 5
, so we getx = 5
.eql w x
checksw
andx
for equality. They are both set to 5, so we getw = 1
.
Try simulating again with a different number as an input for inp w
. Does the final value of w
change? No — it's always 1! What's going on here?
Let's zoom in on the second-to-last instruction: add x w
. The values that instruction operates on are w = Input_0
i.e. "the first program input value," and x = Exact(0)
i.e. the number zero. After performing its operation x = x + w
, our code from the previous episode decides the result is x = Unknown
. This is technically true! We can't know what value x
has exactly. But we can do better than Unknown
!
We know x
was zero before w
got added to it: x = 0 + w
, so it must be the case that x = w = Input_0
after the instruction. Imagine our compiler code was able to determine this as well, and thus computed x = Input_0
as the outcome of the addition instead of x = Unknown
. Continuing to the next instruction: eql w x
then compares w = Input_0
and x = Input_0
. Those are clearly the same value! So the result of eql w x
would be w = Exact(1)
, just as our hand-simulation of the program showed!
The key observation: Input_0
represents a specific fixed value. We can't know what number it represents until we run the program, but for a given execution of the program, it's always going to be the same fixed number: the first number in the input. This is why we can know that w = Input_0
and x = Input_0
are equal to each other, and therefore eql w x
results in an output of w = Exact(1)
.
We managed to evaluate the eql
instruction over values we don't know exactly! Neat, right?
Could we do the same thing with Unknown
values as well?
What is known about Unknown?
Let’s set up an equivalent example to the last episode’s challenge, but with an eql w x
instruction that evaluates two Unknown
values instead of two inputs.
Say we have some long program that’s been running for a while, and all the registers currently have Unknown
values. The next instructions in the program do the following:
- zero out register
x
by multiplying its value by 0, then - make
x
also hold whatever value was inw
by runningadd x w
, and finally - compare
w
andx
for equality usingeql w x
as before.
Using the same reasoning as earlier in the post, we see that x
and w
must again have the same value at the point where they get compared, so the result of the equality check instruction should be w = Exact(1)
. And yet, looking at the last row of the printout, our compiler claims that comparing w = Unknown
to x = Unknown
produces an output of Unknown
. What’s up with that?
Right now, Unknown
in our compiler means “we know absolutely nothing about this value.” It does not refer to a specific-yet-unknown value, so we can’t say for sure if two Unknown
values are equal or not equal to each other. This is unlike input values, where Input_0
refers specifically to the first program input, Input_1
refers to the second program input, etc.
The registers w = Input_0
and x = Input_0
compare equal to each other because they both refer to the same fixed but still-unknown value. When our compiler evaluated add x w
for x = Exact(0)
and w = Unknown
, it found the result was x = Unknown
— but it had no way to represent that this is the same Unknown
that is also currently in w
.
Just like we currently number program input values as Input_0, Input_1
and so on, what if we tried numbering every value we encounter in the program, including Unknown
ones? Then, we could express that two registers hold the same Unknown
value: e.g. “the fifth value in the program, which is Unknown
.” For convenience, we’ll use the notation 5: Unknown
to describe that “fifth program value, which is Unknown
.”
We’ll apply this numbering scheme to all three kinds of values our compiler may encounter in the program: Exact, Input, Unknown
. We’ll use a simple counter to ensure each new value gets its own unique number — let’s call this identifier a Vid
. Whenever we know that an operation preserves the value of one of its operands (like adding a number to zero), we’ll keep the same value with the same Vid
instead of making a new value with a new Vid
.
Implementing our idea
To help us skip straight to the interesting parts, I've added about 40 lines of boilerplate and helper code behind the scenes. We now have a Program
type that keeps track of which Vid
numbers we've already used, and has three helper methods on it called new_exact_value(), new_unknown_value(), new_input_value()
which ensure we correctly assign unique Vid
numbers to newly-created values.
First, let's implement the Vid
type that represents the unique value IDs, and insert it into our existing Value
enum's variants.
In the previous episode, we defined the Value
enum like this:
Let's change it to the following:
When considering two Value
entities in a program, when can our compiler know for sure that they must represent the same number? It's in one of these cases:
- if both are
Value::Exact
variants representing the same known constant number, or - if both of them have the same
Vid
.
Importantly, if two Value
entities have different Vid
fields, this does not mean they must always hold different numbers. Different Vid
fields means that we don't know one way or the other whether the values are equivalent. For example, 5: Input_0
and 6: Input_1
have different Vid
fields: 5
and 6
. Depending on the inputs given to any run of the program, they may sometimes but not always happen to represent the same number.
Let's implement this logic in the ==
operator for Value
:
We can now simplify our no-op detection code to take advantage of our new ability to check values for equality. Instead of looking for a hardcoded set of instructions and values, we can directly compare the register values before and after each instruction. If an instruction did not change the value of its target register and had no other side-effects, that instruction is a no-op by definition.
We have to consider instruction side-effects here. For example, reading program input via an inp
instruction affects state that is outside the w, x, y, z
registers: the next number that will be read by the next inp
instruction. Eliminating an inp
instruction would cause subsequent inp
to read different program inputs. Fortunately, our implementation already guarantees that inp
instructions can never be considered no-ops: they are assigned a brand-new Vid
which could not possibly have been shared with the value previously in the destination register. If a future extension to the MONAD language adds more instructions, we may have to update our no-op detection code. For example, consider a hypothetical print x
instruction that outputs the value of the selected register x
: it doesn't change the value of any register, and yet it isn't a no-op because of the side-effect of printing out a number.
So we can completely remove the is_instruction_no_op()
function we previously wrote, and replace its use inside constant_propagation()
with a simple equality check between the old and new values of the destination register of the current instruction:
constant_propagation()
will also require a small refactor to generate Vid
data for each Value
. This is straightforward and uses the convenience functions I earlier mentioned I added "off screen." Here's what our final constant_propagation()
implementation looks like:
Finally, we'll need to update evaluate_instruction()
to help it avoid generating new Vid
entries whenever it's possible to reuse one of the values on which the instruction operates. Recall the situation from the challenge program we're solving this week:
We want evaluate_instruction()
applied to add x w
to realize that the result in x
is going to be that same Input_0
value that is also in w
, and not some new value for which nothing is known. Each of the possible instructions has different special cases, so we'll split evaluate_instruction()
into per-instruction cases:
Then, we'll implement the per-instruction functions, each with their own special cases. For example, here's evaluate_add()
:
Here's evaluate_mul()
, which has a few more cases to consider:
Most of the other instructions follow the same pattern, so I won't include all of them in this post — you can find them all in the GitHub repo. The exception is evaluate_equal()
which has just a touch more nuance:
And like that, we've implemented value numbering.
Specifically, we've implemented global value numbering. The difference between global and local value numbering is whether the numbered values apply to the entire program, or are valid only within a given function, if-statement, loop, or similar program component (called a "basic block"). The MONAD language does not have any functions, if-statements, or loops — making the entire program a single basic block — so our value numbering is global because it applies to the entire program. This simplicity is part of why this series chooses MONAD as the language to compile and optimize: its simplicity makes it easier to explain compiler ideas one at a time without requiring a lot of prerequisite knowledge. Case in point: basic blocks are a fundamental idea in compilers and a prerequisite for building a compiler for any language like Python or Rust or C/C++, yet three episodes into the blog series, the first mention of the term "basic block" is in this entirely-optional footnote.
As expected, our compiler now figures out that our challenge program ends with the number 1
in the x
register instead of an Unknown
value! 🎉
We're again interested in the "real-world" impact of our optimization, as measured on the standard Advent of Code MONAD input program.
First, what kind of impact do we expect to find?
Without value numbering, in our previous Compiler Adventure we saw our compiler was completely unable to reason about any value that is not Exact
. Operations on Input
values would produce Unknown
values as we saw in the challenge program, and Unknown
represented a total absence of any information at all.
In fact, the lack of value numbering meant our compiler was incapable of even simple reasoning like "registers don't change their value unless they are written to." For example, say the value in register w
was Unknown
and the compiler was considering an instruction that did not use w
, such as mul x y
. The value of w
after that instruction would remain Unknown
— but since Unknown
values had no way to be compared for equality, even the "before" and "after" Unknown
values of w
could not be considered equal.
This severely limited our compiler's ability to optimize MONAD code in the previous episode. In this episode, we've helped our compiler gain more insight about how Unknown
values are used, but there aren't many cases where value numbering by itself allows the compiler to eliminate instructions and optimize a program. We haven't yet implemented any of the powerful optimizations that value numbering makes possible, but we've laid all the groundwork we'll rely on in the next episodes! So we don't expect to see much if any speedup just yet, and we should find a better way to measure our work.
A good metric for our work would capture how much difference adding the compiler equivalent of object permanence makes. In other words, when our compiler sees a non-Exact
value, how often does it know that value to be identical to another value used elsewhere in the program?
Without value numbering, the answer would be "never" i.e. 0% of the time. Let's update our compiler to measure these new stats:
$ cargo run registers sample_programs/aoc_challenge.txt
< ... snip ... >
Total non-input instructions: 238
- with 1+ non-exact value: 218 (91.6%)
- without any exact values: 53 (22.3%)
Total non-exact values uses: 462
- number of unique values: 191 (41.3%)
- non-unique uses: 271 (58.7%)
In the Advent of Code MONAD program, instructions reference or produce an Unknown
or Input
value 462 times. Value numbering determined that there are no more than 191 (41.3%) unique Unknown
or Input
values in the program; the remaining 271 uses (58.7%) are guaranteed to be reuses of a value already in existence elsewhere. Each of these repeated uses is an opportunity for future optimizations! For now, that's a job well done!
Wrap up + challenge question
The completed code from this post is on branch part3_finished
on GitHub.
In the meantime, consider the eql_2_on_eql_result
program below, which is also available on GitHub in the sample_programs
directory:
inp w
inp x
eql w x
eql w 2
What is the value of w
at the end of the program? Our compiler says it's 9: Unknown
— can you do better?
All this and more in the next Compiler Adventure — see you next time!
If you liked this post, please share it with friends — compilers are for everyone! Reach out to me on Twitter or join the discussion on HackerNews.
Thanks to Paul Hemberger, James Logan, Russell Cohen, and Vincent Tjeng for their feedback on drafts of this post. Any mistakes are mine alone.