Assignment 1: Testing the compiler
Compiling the compile
program
To compile the compiler, cd
into the src/ch2
directory and type make
.
This will compile the compiler (which is an executable file called compile
).
You should see a number of warnings when you compile the compiler;
that's expected.
(As you fill in the code for the compiler passes,
these warnings will go away).
Running the compile
program
To run the compiler, type ./compile
in the ch2
directory, along
with the path of a file to compile and (optionally) some command-line options.
Typing ./compile --help
(or just ./compile
with no arguments)
will display this usage message:
usage: compile <filename> [-pass <pass>] [-only] [-eval] [-no-fix-label]
[-sexp-width n] [-sexp-indent n] [-help]
-pass Last compiler pass (one of: lvar un rc ec si ah pi pc pa)
-only Only do one compiler pass
-eval Evaluate after compiling
-no-fix-label disable `fix_label`
-sexp-width S-expression maximum line width
-sexp-indent S-expression indent
-help Display this list of options
--help Display this list of options
Note that the filename argument is required.
(Don't type the <
>
characters;
that's just to indicate that the name filename
is a placeholder for the actual filename.)
We will describe the command line options below.
If there are no command-line options,
the file will be compiled all the way to assembly language,
and the output will be printed to the terminal.
If you want to save the output to a file,
you can redirect it with the Unix shell operator >
as follows:
This will save the printed output to a file named var_test_1.s
.
(This is an appropriate name, since assembly language files
conventionally end in .s
.)
From there, it can be compiled to an executable and run
as described below.
Here are the command-line arguments for the compile
program:
-help
(or--help
, or no arguments).
This prints out a usage message.
-pass <pass>
This tells the compiler to stop and print the output after the pass
<pass>
. The allowed passes for this compiler are:
lvar
-- the "Lvar" language AST (not technically a pass)un
-- the "uniquify" passrc
-- the "remove complex operands" passec
-- the "explicate control" passsi
-- the "select instructions" passah
-- the "assign homes" passpi
-- the "patch instructions" passpc
-- the "prelude and conclusion" passpa
-- the "print assembly" pass
The lvar
"pass" just runs the parser. This converts the source code
to the Lvar
AST form.
-only
Adding this option causes the compiler to run only the pass specified
by the -pass
option. For this to work, the input file must be
in the correct format, representing an S-expression version
of the datatype that is the input to the pass.
We normally will use the file extensions .un
, .rc
etc.
to indicate that a file has been compiled up to that pass.
This option is normally used with the reference outputs; these provide
outputs of the compiler for all passes and for all test programs.
Each file in the reference/
directory contains a file extension
which gives information about the last pass which was used to compile it.
Therefore, you can use this file as input to the next pass.
For instance:
will compile the file reference/var_test_11.rc
(which has already been compiled up to the "remove complex operands" pass)
using only the "explicate control" pass (which is the next pass).
If you attempt to compile a file this way using the wrong input, you will get an error message which may be hard to understand.
This is a good way to test the code for a single pass, and you can do this even if the previous passes haven't been written.
-eval
Adding this option causes the compiler to run an evaluator after compiling.
Note that not all passes have evaluators; only the ones up to ec
do.
Also, note that -eval
and -only
are mutually exclusive.
-sexp-width n
If the S-expressions printed are too narrow or wide for your taste,
you can adjust it with this option, which sets the width of the
S-expression display in columns. By default, n
is 40.
Note that if n
is too small, the S-expressions may be spread out over
more lines than you care to view. Conversely, if it's too large,
too many S-expression forms may be crammed into a single line.
-sexp-indent n
This allows you to set the degree of indentation for S-expressions. By default, it's 2. It's unlikely that you'll want/need to change this.
Compiling and running assembly language code
Manual compilation
Once you've compiled the source programs all the way down to assembly language, you will probably be wondering how to turn the assembly language into a working executable program. If you have a computer running a 64-bit Intel or AMD processor (which use the x86-64 instruction set), or a Mac with an M-series processor (e.g. M1, M2) and Rosetta 2 installed, you can compile the assembly language code that the compiler generates.
Note
So, basically, almost any computer you are likely to have will be able to compile the assembly code you generate.
Let's use the var_test_5.src
file as an example.
You will also need the C code files runtime.c
and runtime.h
,
which should be in your ch2
directory.
Here is the sequence of commands. Note that assembly language files
normally end in .s
, so we redirect the compiler output to the
filename var_test_5.s
and compile it with the gcc
C compiler
(which needs to be installed if it isn't already).
$ ./compile tests/var_test_5.src > var_test_5.s
$ gcc -c var_test_5.s
$ gcc -c runtime.c
$ gcc var_test_5.o runtime.o -o var_test_5
$ ./var_test_5
$ echo $?
42
Note
On an M-series Mac with Rosetta 2 installed, change these commands to:
This compiles the assembly language file var_test_5.s
to the binary executable program var_test_5
.
When this program is run, it doesn't appear to do anything.
However, the program returns an integer return code to the operating system,
which in this case is the number 42. The line echo $?
prints this number.
Note
These return codes can only be in the range 0 to 255, so if you return an integer outside this range, it will be coerced into that range, leading sometimes to peculiar results.
If the program you are compiling has calls to the read
function,
you will have to input the integers to be read when the program runs.
The run_eval_tests.py
test script handles all of this for you,
so there is no need to actually go through these steps,
but you should know how to do them anyway.
run_asm.py
You can also use the script run_asm.py
which is in the scripts/
subdirectory of the ch2
directory.
For example:
which will output:
COMMAND: gcc -c var_test_5.s
COMMAND: gcc -c runtime.c
COMMAND: gcc var_test_5.o runtime.o -o var_test_5
COMMAND: ./var_test_5
OUTPUT (return code): 42
If you are running the code on an M-series Mac,
add the -arm64
command-line argument to run_asm.py
:
This will output:
COMMAND: clang -c var_test_5.s -arch x86_64
COMMAND: clang -c runtime.c -arch x86_64
COMMAND: clang var_test_5.o runtime.o -o var_test_5 -arch x86_64
COMMAND: ./var_test_5
OUTPUT (return code): 42
Note that the C compiler name has been switched to clang
(which, for our purposes, works the same as gcc
),
and some extra arguments are present.
The run_asm.py
script also removes all generated files,
so you don't have to.
Again, though, you don't need to use this script.
The run_eval_tests.py
script will do all of this for you.
However, run_asm.py
is convenient
if you have an assembly language file
and you want to compile and run it separately from the tests.
Testing your compiler: the test scripts
There are three scripts in the ch2/scripts
subdirectory.
One is the run_asm.py
script described above;
the other two are described here.
All of these are Python scripts. 1
run_eval_tests.py
This script can be used to test that a particular .src
file
generates the correct output when given particular inputs.
It uses metadata stored in comments in the .src
files
in the /tests
subdirectory.
For instance, consider the file var_test_10.src
:
The metadata is in the first two lines,
with the INPUT:
and OUTPUT:
tags.
These indicate that the program should be run twice:
the first time with (terminal) inputs 45
and 3
,
producing the output 42
,
and the second time with (terminal) inputs 21
and 20
,
producing the output 1
.
You invoke the test script this way:
Note
On an M-series Mac, add the command-line argument -arm64
to the above command:
It will output:
----
input file: var_test_10.src
* input/output data #1:
Running test file (tests/var_test_10.src) up to pass (lvar).
Running test file (tests/var_test_10.src) up to pass (un).
Running test file (tests/var_test_10.src) up to pass (rc).
Running test file (tests/var_test_10.src) up to pass (ec).
Compiling to assembly language and compiling/running the program.
* input/output data #2:
Running test file (tests/var_test_10.src) up to pass (lvar).
Running test file (tests/var_test_10.src) up to pass (un).
Running test file (tests/var_test_10.src) up to pass (rc).
Running test file (tests/var_test_10.src) up to pass (ec).
Compiling to assembly language and compiling/running the program.
This can be done for many files, or even for all tests files at once:
(This will produce a lot of output!)
If one of the evaluators produces incorrect output, an error message will be printed and the test script will halt.
By default, the test script will not just test the program outputs using
the evaluators of the intermediate languages (lvar
, lvar_mon
, etc.),
but will also compile the code all the way to assembly language, run it,
and test the output against the expected output.
compare.py
The idea of this test script is to simplify the process of
comparing the output of compiling a file using only a single pass
with the corresponding output file
in the reference/
subdirectory.
Of course, you can do this yourself manually;
for instance, you can do this:
and compare the file var_test_11.ec
that was generated by your compiler
to the file reference/var_test_11.ec
in the reference/
subdirectory.
If they are the same, all is well.
However, repeating this for a lot of files and a lot of passes
is very tedious!
Also, you might miss something if you just compare them visually
("eyeballing it"). Instead, you can use the diff
program to test
if the files are different:
If there is no output, the files are identical. If there is any difference, the lines that are different will be printed in a format which shows what's different. But again, doing this for a lot of files is going to be tedious.
Note
It's important to realize that your output files can have some differences from the reference output files and still be acceptable. For instance, the "uniquify" pass is not required to give the exact same names to variables whose names are changed, as long as the names are changed consistently. Nevertheless, if you can make the outputs identical, it will greatly simplify testing. Therefore, we want you to do your best to make the test outputs identical to the reference outputs. If you are having trouble doing this, ask questions during office hours or during code reviews.
The purpose of the compare.py
script is to simplify this process.
If you run it with no arguments, you get a usage message:
$ python scripts/compare.py
usage: python compare.py [-pause] [-diff] [-random n] filename [filename ...]
The required arguments are one or more filenames.
These files should be in the reference/
subdirectory.
The files in that directory have file extensions corresponding to
the last compiler pass that was used to generate them, so (for instance)
var_test_11.ec
is what the compiler outputs when compiling the file
var_test_11.src
up to the ec
(explicate control) pass.
If you want to test your var_test_11.ec
against the reference version,
you should start with the output of the previous pass,
which in this case is var_test_11.rc
(the rc
or "remove complex operands" pass).
Since this file is also in the reference/
subdirectory,
you can use it as the compiler input. So if you type:
the script will:
-
run
./compile reference/var_test_11.rc -pass ec -only
and redirect the output to a file calledvar_test_11.ec
in thech2
directory; -
display the files
var_test_11.ec
(your compiler's output) andreference/var_test_11.ec
(the reference output) side-by-side so you can compare them visually.
Note
This behavior is extremely counterintuitive for a lot of people.
It's natural to assume that if you want to compare the result
of the "explicate control" (ec
) pass, for instance,
you should pass the compare.py
script as its input
a file with the .ec
extension.
However, this isn't the way it works.
Files with the .ec
extension have already been processed
by the "explicate control" pass.
Put differently, the file extension refers to the
last pass the code went through before the file was output.
The compare.py
script needs (in this case)
the output of the pass before "explicate control",
so it can run just that specific pass on the file.
The pass before "explicate control" is "remove complex operands",
with the file extension .rc
.
Therefore, if you want to compare the results of the
"explicate control" pass,
you need to invoke the compare.py
script
on a file with the file extension .rc
, not .ec
.
Put simply, the files you need for the compare.py
script
are the input files for the pass you are testing,
so their file extension is the extension for the
previous pass,
not the file extension for the pass that you are testing.
The output will look like this:
$ python scripts/compare.py reference/var_test_11.rc
--------------
input: reference/var_test_11.rc
output: reference/var_test_11.ec
# Student version. # Reference version.
(CProgram (CProgram
(Info (Info
(locals_types (locals_types
((x.1 Integer) ((x.1 Integer)
(x.2 Integer) (x.2 Integer)
(y.1 Integer)))) (y.1 Integer))))
(((Label start) (((Label start)
(Seq (Seq
(Assign x.1 (Atm (Int 20))) (Assign x.1 (Atm (Int 20)))
(Seq (Seq
(Assign x.2 (Atm (Int 22))) (Assign x.2 (Atm (Int 22)))
(Seq (Seq
(Assign y.1 (Add (Var x.1) (Var x.2))) (Assign y.1 (Add (Var x.1) (Var x.2)))
(Return (Atm (Var y.1))))))))) (Return (Atm (Var y.1)))))))))
(You can scroll this output horizontally to see it all,
but the two files are identical.)
If you just want to check if the files are identical,
use the -diff
option:
If there is no difference, you'll get an OK
as you see here.
Note
Most people just use the -diff
option,
only using the full printout if there is a difference.
In either case, any generated files are removed
before the compare.py
script exits.
This can be repeated for any number of files:
$ python scripts/compare.py reference/var_test_?.rc -diff
reference/var_test_1.rc : OK
reference/var_test_2.rc : OK
reference/var_test_3.rc : OK
reference/var_test_4.rc : OK
reference/var_test_5.rc : OK
reference/var_test_6.rc : OK
reference/var_test_7.rc : OK
reference/var_test_8.rc : OK
reference/var_test_9.rc : OK
If you do this without the -diff
option, though,
the output can get very large, and you'll have to scroll back
to check each file.
To make this easier, we've added a -pause
option,
which will display one file at a time
(both versions: yours and the reference one)
and wait for you to hit the return key
before the next one is displayed.
The last feature is the -random
option.
It's used with an argument, which should be a positive integer.
With -random N
, up to N
randomly-selected files will be chosen
from the list of files on the command line and compared.
This is useful to quickly check if a pass is working well;
you can type something like this:
and get comparisons of (in this case) 10 random files selected from the files specified on the command line:
reference/var_test_4.rc : OK
reference/var_test_6.rc : OK
reference/var_test_15.rc : OK
reference/var_test_27.rc : OK
reference/var_test_38.rc : OK
reference/var_test_40.rc : OK
reference/var_test_57.rc : OK
reference/var_test_58.rc : OK
reference/var_test_62.rc : OK
reference/var_test_67.rc : OK
Once everything works
Congratulations! You have written your first compiler!
The workflow for subsequent compilers will basically be the same as for this one. There will be many more passes, and occasionally some other things that need to be tested.
-
Since the compiler is written in OCaml, you might wonder why the testing scripts are written in Python. We don't think it's a good idea to get too obsessed with any one programming language. OCaml is a fine language for writing a compiler, but Python is more convenient when working with large numbers of files and calling programs to act on those files. Traditionally, this kind of thing is done with shell scripts, but Python is vastly more powerful and flexible, as well as cross-platform, and all of you already know it. ↩