Assignment 1: Compiler passes

This section describes the code you have to write.

The compiler passes are described in chapter 2 of the textbook, but they are here again for completeness. We will only include passes that you have to implement. For example, even though the parser can be thought of as a "pass", you don't have to implement it, so we don't include it here. Similarly, the "print assembly" pass is provided for you, as are all the intermediate languages and interpreters.¹

Note that the only files you should modify are the files corresponding to these compiler passes. Also, only modify the .ml files; the .mli files constitute the interface to these modules and must not be changed.

In addition, when we provide a function stub in a .ml file that you need to complete, that means that we expect that you will implement that function (with those arguments and types (if supplied)) as written (filling in the TODO parts, of course). In particular, you're not allowed to change the number of arguments to the function, or their types (if supplied). If a function is completely implemented (no TODOs), you should leave it as-is. On the other hand, you can write as many extra functions as you like. (If we don't like your choices, we'll let you know during code reviews!)

Note

We expect that you will be using the functions we require you to implement! Don't implement a required function and then not use it later. (This would generate a warning anyway.)

Uniquify

[File: uniquify.ml.]

In this pass, all names bound in let expressions have to be renamed to be unique. This enables a number of further transformations in later passes. Both the input and output languages for this pass are Lvar, defined in the files lvar.ml[i].

This is a relatively simple pass to implement. There are four cases:

Forms with no subexpressions and no variables. These pass through unchanged.
Forms with subexpressions but no variables. You have to recurse on the subexpressions and build up a new expression.
Var expressions. You need to change the variable to its new name, if there is one. For this, you will have to have a data structure which maps the old names to the new names. We recommend that you use the VarMap module (defined in the Types module i.e. types.ml and types.mli), which is an instance of the Map functor specialized for the var type. (Note that var is just an alias for string.) A value of type VarMap.t is a map from variable names to some other type; in this case we map variable names to (new) variable names!

Note that the uniquify_exp function doesn't have a VarMap.t argument, so you will need to write a helper function that does. When a Var expression is encountered, check to see if the map has a new name for the variable. If it does, substitute the new name for the old one. Otherwise, leave the name unchanged. (This can happen with some global names.)
Let expressions. When these are encountered, you have to generate a new name from the binding name of the let expression. In the expression (let (x 10) (+ x x)), for instance, x is the binding name.

Note

In compiler-speak, generated names that are different from all other names are called fresh names.

To generate the new name, call the fresh function (defined in the file) with the current name and a separator as the arguments. Use . as the separator, so x might become e.g. x.1. Note that the fresh function has an internal counter, so the next time it's called with the name x it will return x.2, etc.

Note

This is imperative programming, and the fact that we don't have to jump through any hoops to do it (as we would, say, in Haskell) is one reason why OCaml is a convenient language for writing a compiler. (Despite this, Haskell is also a great language for writing compilers!)

The fresh function is just an alias for the function Utils.gensym from the Utils module in the support library. Feel free to look at the code for this function if you're interested. This function also uses OCaml's labelled arguments feature, which we didn't use in CS 4. A good reference for labelled arguments is here.

When you generate a fresh name for a binding name, you also have to add it to the map for the body expression of the let, but not for the binding expression of the let (since the new name is not in effect for that expression).

Note

Do not use hash tables for the map datatype. Hash tables are imperative, and it's much too easy to add something to a hash table that will persist longer than you want it to. Instead, use an immutable map datatype (here, VarMap.t). When you add something to an immutable map, you get a new map, and you can use the new or old maps as you see fit.

You should be able to implement this pass in about 50 lines of code (or less).

Warning: evaluation order

When we convert code in our compilers, we always convert from left-to-right if there is a choice. This sometimes conflicts with OCaml's evaluation rule, which evaluates from right to left. To avoid problems with this, use let expressions in your OCaml code to convert subexpressions, and then assemble the full expression from the converted parts. If you don't do this, your converted code may still be correct, but it will not match the instructor's version and will thus fail the "compare" tests, which expect the converted code to be identical to the reference version.

Remove complex operands

[File: remove_complex.ml]

The purpose of this pass is simply to make sure that the operands of arithmetic operations are "simple" i.e. are not subexpressions but "atomic" expressions like integers or variables. The input language to this pass is Lvar and the output language is Lvar_mon, which is defined in the files lvar_mon.ml[i].

Terminology

Integer literal and variable expressions are simple or atomic expressions. Any other expression is complex.

The name Lvar_mon refers to the fact that the language is in monadic normal form, the details of which do not concern us here.

We define an "atom" datatype in lvar_mon.mli as follows:

type atm = Int of int | Var of var

and then use it in the expression datatype:

type exp =
  | Atm    of atm
  | Read
  | Negate of atm
  | Add    of atm * atm
  | Sub    of atm * atm
  | Let    of var * exp * exp

The basic thing that has to happen in this pass is that non-atomic subexpressions of arithmetic forms (Add, Sub, and Negate) have to be transformed into let expressions which bind variable(s) to the complex subexpression(s). So it's something like this:

Negate <complex exp>
--> Let ("$tmp", <complex exp>, Negate (Var "$tmp")

Clearly, you will need to introduce let expressions wherever there are arithmetic expressions with complex subexpressions.

On the other hand, expressions that are already atomic should stay the way they are:

Negate (Int 10) --> Negate (Int 10)
Negate (Var "x") --> Negate (Var "x")

Don't do this:

Negate (Int 10)
--> Let ("$tmp", Int 10, Negate (Var "$tmp"))
Negate (Var "x")
--> Let ("$tmp", Var "x", Negate (Var "$tmp"))

This is called "generating unnecessary temporaries". Although this doesn't yield incorrect code, it does yield inefficient code. (Why do you think that is?)

The key function to write here is rco_atom. This takes an expression which needs to become atomic (like a subexpression of Add, say), and returns both the atomic expression and a list of (name, expression) bindings. In many cases you'll need to generate temporary names; the gen_temp_name function is provided to you for this purpose.

rco_atom is called from the rco_exp function, and the rco_exp function has to take the (name, expression) pairs and convert them into let expressions. (We wrote a helper function to do this; you will probably want to do the same.)

Depending on how you write this pass, rco_atom may need to call rco_exp. If so, the two functions should be mutually recursive. (Mutually-recursive functions are common in compilers.)

let expressions

let expressions are processed differently in the rco_atom and rco_exp functions.

In rco_exp, note that the immediate subexpressions of a let expression (both of them) do not need to be atomic, so don't try to make them atomic! That means that you can have arbitrarily nested let expressions. If this bothers you, don't worry: we'll fix it in the next pass!

However, in rco_atom, the whole point of the function is to convert the expression into an atom (plus some bindings). If you think about it, that means that you need to convert the body of the let expression into an atom. The binding part of the let (suitably converted) is just included as part of the bindings that rco_atom has to return.

Explicate control

[File: explicate_control.ml]

The purpose of this pass is to "flatten" the code representation into a sequence of assignment statements followed by a "return" statement. The let statements are gone, replaced by assignments. There is no more nesting of expressions, unless you consider the sequence of assignments "nesting".

The input language to the pass is Lvar_mon and the output language is Cvar, defined in the files cvar.ml[i]. The name Cvar was chosen because the straight-line nature of the code is very similar to the way code is represented in a C language program.

The source code consists of five functions:

convert_atom
convert_exp
explicate_assign
explicate_tail
explicate_control

explicate_control is supplied for you. convert_atom and convert_exp are straightforward. explicate_tail converts an expression in tail position, while explicate_assign converts an assignment statement followed by an already-converted tail expression.

The transformation is described in lecture 4 and chapter 2 of the textbook (section 2.6). (There is also a skeleton implementation in Racket, which may be helpful.) Note that in the textbook, the phrase "the right-hand side of a let" means "the binding expression of a let"; schematically, a let expression looks like this:

(let <binding variable> <binding expression> <body>)

In both explicate_tail and explicate_assign, the only (slightly) tricky case is the let case. Here are some hints for this case:

explicate_tail has to call explicate_assign.
In explicate_assign, the transformation is basically this:
```
(let (v (let (v' e1) e2)) tl)
-->
(let (v' e1) (let (v e2) tl))
```
The code for this is quite simple.

The total code you need to write is considerably less than 100 lines.

Select instructions

[File: select_instructions.ml]

This is the first of several passes collectively called the "back end". Their job is to convert the code into x86-64 assembly language. There are three intermediate languages specific to these passes:

x86var (x86var.ml[i])

This language mixes assembly language instructions with variables.
x86int (x86int.ml[i])

This language gets rid of variables. Variables are represented by stack locations (in this assignment) and by a combination of stack locations and registers (in later assignments).
x86asm (x86asm.ml[i])

This language adds extra code (the "prelude" and "conclusion") necessary to make the entire program into a single unit that can be compiled into a full executable program.

The x86asm language

The x86asm intermediate language is not described in the book; the book simply outputs assembly language as a string after the "prelude and conclusion" pass. We think that having another intermediate language is more elegant, and also could eventually be used by an interpreter which interprets the final assembly code directly.

After x86asm, actual assembly code is trivially generated by the "print assembly" pass (print_asm.ml[i]). This code is supplied to you.

In the "select instructions" pass, code is converted from the Cvar language to the x86var language.

The select_instructions.ml file is mostly empty. You have to implement the convert_lt function, which converts a (label, tail) pair (Cvar language) into a (label, block) pair (x86var language). The labels don't change, so essentially you are changing a Cvar tails into x86var blocks (lists of instructions). You will want to write helper functions to do this. (We put ours outside of the convert_lt function; you can do it any way you like.)

Quoting from the textbook:

We recommend implementing the select_instructions [pass] with three auxiliary functions, one for each of the nonterminals of Cvar: atm, stmt, and tail.

The textbook has a good description of how to convert Cvar tails/statements/expressions/atoms into x86var instructions. Note that a single Cvar expression can yield more than one x86var instruction.

The Return instruction requires special treatment. You don't actually generate a Retq instruction since this doesn't actually represent returning from a function (we'll see why in the "prelude and conclusion" pass below). Instead, you have to do the following:

Convert the expression which is the argument of the Return and move that result into the Rax register (this may take one or two instructions, depending on the expression). Note that a function call (which can only be Read here) will automatically return its result into the Rax register.
Emit a Jmp instruction to a label called conclusion.

There is one other peculiarity of this pass that isn't in the book. We've provided a function called fix_label which should be used whenever calling an external function. In this case, the only external function is read_int, which is what a Read instruction in Cvar should compile to. So Read becomes Callq (Label "read_int", 0). (The 0 in Callq (Label "read_int", 0) is the function arity i.e. the number of arguments the function takes.)

However, different operating systems have different conventions for labels. In particular, MacOS requires that function labels (but not other labels!) have an initial underscore, so you should compile this into Callq (Label "_read_int", 0) if you're on a Mac. To make the compiler more independent of the OS it's being run on, we've added the fix_label function, which checks to see if the OS is "MacOS", and if so, prepends the underscore to the label. So instead of converting Read to Callq (Label "read_int", 0), convert it to Callq (Label (fix_label "read_int"), 0).²

Also, chapter 2 of the textbook (section 2.7) describes ways of optimizing certain instructions. For example, an Add expression/assignment of the form v = (+ a1 a2) can be compiled into two instructions, but if a2 is the same as v (i.e. v = (+ a1 v)), it could be compiled into a single addq instruction. This is all well and good, but because of the "uniquify" pass earlier in the compiler, this will never happen, because v = (+ a1 v) would actually become v1 = (+ a1 v2). Therefore, you should not write the code to optimize instructions in this way. (This will also reduce the amount of code you need to write.) There is no reason to add code for situations that can't occur and can't be tested. However, you should definitely use as few instructions as possible (in this case, two instructions).

Note

In later compilers, we will eventually add these optimizations back, because once the language has been sufficiently expanded, these situations can occur.

Assign homes

[File: assign_homes.ml]

The purpose of this pass is to convert variable names into stack locations. Note that in this compiler, we are not storing variables in registers. (In all the later compilers, we will be.) Therefore, this pass will only exist in this compiler; in later compilers, it will be replaced by several register allocation passes. Fortunately, assigning variables to stack slots is straightforward.

Although this pass converts x86var programs to x86var programs, the resulting programs will not use variables. (We will get rid of variables altogether in the x86int language.) However, the info field of the programs will change; after the "select instructions" pass the info field is info1, which stores type information for all variables; after "assign homes" it's info2, which only stores the stack space used.

Variables placed on the stack are identified by their position in memory relative to the "base pointer". This is a memory location which is stored in a special register called rbp (often written %rbp, though in the OCaml code we refer to it as Rbp).³ The base pointer represents the start of the stack used for a particular function. Variables stored on the stack are identified relative to the base pointer. Stack variables are placed in memory below the base pointer (we say that the stack "grows downwards"). Variables are 8 bytes in size (64 bits), so the first variable will be stored 8 bytes below the base pointer. In assembly language, this is written as -8(%rbp); in the OCaml code we write it as Deref (Rbp, -8). The next stack location will be -16(%rbp), and so on. The space used for all the stack variables is known as the "stack frame". One curious requirement of x86-64 assembly language is that the total amount of space used for the stack frame must be a multiple of 16 bytes, so even if you only need one stack variable (8 bytes), you have to reserve 16 bytes.

Here's what you need to do in this pass:

Use the info1 field to assign stack locations to all variables used in the program. info1 contains a list of all variables in the program as well as their types. (The types are all Integer, so that isn't relevant except that you need to know that all integers are 8 bytes long.) You should store the (variable name, stack location) pairs as an int VarMap.t (mapping names to their stack locations). You will also need to compute the total amount of stack space required. Since this has to be aligned on a 16 byte boundary, we've provided a utility function called align_16 in the Utils module of the support library.
Once this has been done, you need to go through all the instructions in the blocks and replace variables with their stack locations. If you encounter a variable which doesn't have an assigned stack location, signal an error (this shouldn't happen, but it's good to check).
Compute an info2 field using the computed stack space, and reconstitute the program.

Patch instructions

[File: patch_instructions.ml]

In a perfect world, we would not need to do this pass. However, the x86-64 instruction set is far from elegant, and we have to deal with its restrictions. One such restriction is that no instruction can have two arguments where both arguments are memory references (as opposed to (say) immediate values or registers). If such instructions exist, they need to be "patched" (fixed) so that the restriction is adhered to.

There are different ways of handling this, but the textbook suggests a very simple strategy. When you have an instruction that uses two stack locations, convert it into two instructions using the %rax register (OCaml Rax) as an intermediary. There are only three instructions in this compiler where this can happen: addq, subq, and movq. The movq case is the simplest. Change:

movq -8(%rbp), -16(%rbp)

to two instructions:

movq -8(%rbp), %rax
movq %rax, -16(%rbp)

This doesn't violate the restriction. The addq and subq cases are similar but slightly trickier.

What you need to do is to go through all instructions, determine if an instruction needs to be patched, and patch it in those cases. Don't patch instructions that don't need to be patched! For instance, don't change:

movq $42, -8(%rbp)

into this:

movq $42, %rax
movq %rax, -8(%rbp)

because the instruction didn't need to be patched (since there was only one memory reference).

Note

The $42 assembly language syntax means an immediate value; in OCaml we write this as Imm 42.

Note also that this pass converts from the x86var language to the x86int language, which doesn't have variables. This means that there will be some trivial "boilerplate" conversions, like converting an Imm constructor in one language to the exact same constructor in the other language. This is trivial but a bit tedious. (It's the price we pay for having precise types.)

Prelude and conclusion

[File: prelude_conclusion.ml]

This is the last pass you need to write. The textbook also includes a partial evaluator pass as an optional "challenge" pass; we won't be doing that.

This pass converts from the x86int language to the x86asm language, which is basically x86-64 assembly language, but in an S-expression representation. The actual assembly language is generated by the "print assembly" pass, which you don't have to write.

The purpose of this pass is to add extra code to make a complete assembly language program. The code that we have been generating so far (whether we were aware of it or not) is the body of the main function. However, in order for this to be a proper function, some code has to be added both before and after the code we've written. The code that is executed before the code we've written is called the prelude, and the code that comes after is called the conclusion. These don't have to come physically before or after our code, since we have jmp instructions to transfer control. Therefore, we put them after our code.

The code we've been working with is all in a single block labelled start. The code that comes before this is going to have to jump to this label, and our code is going to have to eventually jump to the label called conclusion. The "prelude" is everything that comes before the jump to start and the "conclusion" is everything that comes after the jump to conclusion.

The actual beginning of execution is the label called main (or _main if you're running on a Mac). We also have to declare main to be a "global" name (as all top-level functions have to be) by using the Global instruction. (This becomes .globl in the actual assembly code.) This declaration has to come before the main label, and it's the beginning of the prelude.

The prelude does the following:

Declares main as a global name and defines the main label.
Saves the previous base pointer (in the %rbp register) to the stack (using a pushq instruction).
Saves the current stack pointer (in the %rsp register) into the base pointer register %rbp.
Reserves space for all the stack variables (a multiple of 16) by modifying the stack pointer %rsp.
Jumps to the start label.

The conclusion does the following:

Reclaims the stack space used in the code.
Restores the old base pointer using a popq instruction.
Returns from the main function with a retq instruction.

This code is completely generic except for the stack space, which is going to depend on how many variables are placed on the stack. You can look at the examples in the reference directory to see what the prelude and conclusion look like. The code is easy to write, although there are a number of trivial "boilerplate" conversions because we are changing the datatype.

The interpreters are not part of the compiler per se, but they are used for testing. ↩
Aren't compilers fun? You have to deal with all of this arbitrary nonsense, because otherwise you can't generate working code. That's the price you pay for doing something cool. ↩
Fun fact: There is no absolute requirement to use the %rbp register to store the base pointer. Many C compilers (e.g. gcc) have an option to omit the base pointer entirely, which frees up this register to be used to store regular variables. This is a big deal for processors with very few registers (like 32 bit x86 processors, which only have 8 registers) but less important for us, since x86-64 processors have 16 registers. ↩