An informal guide to reading and working on the rustboot compiler.
==================================================================
First off, my sincerest apologies for the lightly-commented nature of the
compiler, as well as the general immaturity of the codebase; rustboot is
intended to be discarded in the near future as we transition off it, to a
rust-based, LLVM-backed compiler. It has taken longer than expected for "the
near future" to arrive, and here we are published and attracting contributors
without a good place for them to start. It will be a priority for the next
little while to make new contributors feel welcome and oriented within the
project; best I can do at this point. We were in a tremendous rush even to get
everything organized to this minimal point.
If you wish to expand on this document, or have one of the
slightly-more-familiar authors add anything else to it, please get in touch or
file a bug. Your concerns are probably the same as someone else's.
High-level concepts, invariants, 30,000-ft view
===============================================
Rustboot has 3 main subdirectories: fe, me, and be (front, mid, back
end). Helper modules and ubiquitous types are found in util/.
The entry-point for the compiler is driver/main.ml, and this file sequences
the various parts together.
The 4 central data structures:
------------------------------
#1: fe/ast.ml defines the AST. The AST is treated as immutable after parsing
despite containing some mutable types (hashtbl and such). Many -- though
not all -- nodes within this data structure are wrapped in the type 'a
identified. This is important. An "identified" AST node is one that the
parser has marked with a unique node_id value. This node_id is used both
to denote a source location and, more importantly, to key into a large
number of tables later in the compiler. Most additional calculated
properties of a program that the compiler derives are keyed to the node_id
of an identified node.
The types 'a identified, node_id and such are in util/common.ml
#2: me/semant.ml defines the Semant.ctxt structure. This is a record of
tables, almost all of which are keyed by node_id. See previous comment
regrding node_id. The Semant module is open in most of the modules within
the me/ directory, and they all refer liberally to the ctxt tables, either
directly or via helper functions in semant. Semant also defines the
mid-end pass-management logic, lookup routines, type folds, and a variety
of other miscallaneous semantic-analysis helpers.
#3: be/il.ml defines the IL. This is a small, typed IL based on a type system
that is relatively LLVM-ish, and a control-flow system that is *not*
expression/SSA based like LLVM. It's much dumber than that. The root of
the interesting types in this file is the type 'emitter', which is a
growable buffer along with a few counters. An emitter is essentially a
buffer of quads. A quad, in turn, is a primitive virtual instruction
('quad' because it is in its limit a 3-address machine, plus opcode) which
we then ... tend to turn directly into x86 anyways. Sorry; it wasn't clear
during initial construction that we'd wind up stopping at x86, so the IL
is probably superfluous, but there it is.
The IL types are operand = cell | immediate, and cell = reg | mem. Plus a
certain quantity of special-casing and noise for constant-pointer
propagation and addressing modes and whatnot.
#4: be/asm.ml defines the Asm.frag type, which is a "chunk of binary-ish
stuff" to put in an output file. Words, bytes, lazily-resolved fixups,
constant expressions, 0-terminated strings, alignment boundaries, etc. You
will hopefully not need to produce a lot of this yourself; most of this is
already being emitted.
An important type that gets resolved here is fixup, from util/common.ml.
Fixups are things you can wrap around a frag using an Asm.DEF frag, which
get their size and position (both in-file and in-memory) calculated at
asm-time; but you can refer to them before they're resolved. So any time
the compiler needs to refer to "the place / size this thingy will be, when
it finally gets boiled down to frags and emitted" we generate a fixup and
use that. Functions and static data structures, for example, tend to get
fixups assigned to them early on in the middle-end of the compiler.
Control and information flow within the compiler:
-------------------------------------------------
- driver/main.ml assumes control on startup. Options are parsed, platform is
detected, etc.
- fe/lexer.ml does lexing in any case; fe/parser.ml holds the fundamental
parser-state and parser-combinator functions. Parsing rules are split
between 3 files: fe/cexp.ml, fe/pexp.ml, and fe/item.ml. This split
represents the general structure of the grammar(s):
- The outermost grammar is called "cexp" (crate expression), and is an
expression language that describes the crate directives found in crate
files. It's evaluated inside the compiler.
- The next grammar is "item", which is a statement language that describes
the directives, declarations and statements found in source files. If
you compile a naked source file, you jump straight to item and then
synthesize a simple crate structure around the result.
- The innermost grammar is "pexp" (parsed expression), and is an
expression language used for the shared expression grammar within both
cexp and item. Pexps within cexp are evaluated in the compiler
(non-constant, complex cexps are errors) whereas pexps within items are
desugared to statements and primitive expressions.
- The AST is the output from the item grammar. Pexp and cexp do not escape
the front-end.
- driver/main.ml then builds a Semant.ctxt and threads it through the various
middle-end passes. Each pass defines one or more visitors, which is an FRU
copy of the empty_visitor in me/walk.ml. Each visitor performs a particular
task, encapsulates some local state in local variables, and leaves its
results in a table. If the table it's calculating is pass-local, it will be
a local binding within the pass; if it's to be shared with later passes, it
will be a table in Semant.ctxt. Pass order is therefore somewhat important,
so I'll describe it here:
- me/resolve.ml looks up names and connects them to definitions. This
includes expanding all types (as types can occur within names, as part
of a parametric name) and performing all import/export/visibility
judgments. After resolve, we should not be doing any further name-based
lookups (with one exception: typestate does some more name
lookup. Subtle reason, will return to it).
Resolve populates several of the tables near the top of Semant.ctxt:
ctxt_all_cast_types
ctxt_all_defns
ctxt_all_item_names
ctxt_all_item_types
ctxt_all_lvals
ctxt_all_stmts
ctxt_all_type_items
ctxt_block_items
ctxt_block_slots
ctxt_frame_args
ctxt_lval_to_referent
ctxt_node_referenced
ctxt_required_items
ctxt_slot_is_arg
ctxt_slot_keys
The most obviously critical of these are lval_to_referent and all_defns,
which connect subsequent visitors from a reference node to its referent
node, and catalogue all the possible things a referent may be.
Part of resolving that is perhaps not obvious is the task of resolving
and normalizing recursive types. This is what TY_iso is for. Recursive
types in rust have to pass through a tag type on their recursive edge;
TY_iso is an iso-recursive group of tags that refer only to one another;
within a TY_iso, the type term "TY_idx n" refers to "the nth member of
the current TY_iso". Resolve is responsible for finding such groups and
tying them into such closed-form knots.
TY_name should be completely eliminated in any of the types exiting
resolve.
- me/type.ml is a unification-based typechecker and inference engine. This
is as textbook-y as we could make it. It rewrites "auto" slots in the
ctxt_all_defns table when it completes (these are the slots with None as
their Ast.slot_ty).
This file is organized around tyspecs and tyvars. A tyspec is a
constraint on an unknown type that is implied by its use; tyspecs are
generated during the AST-walk, placed in ref cells (tyvars), and the
cells are and unified with one another. If two tyvars unify, then a new
constraint is created with the tighter of the two and the two previous
tyvars are updated to point to the unified spec. Ideally all constraints
eventually run into a source of a concrete type (or a type otherwise
uniquely-determined by its tyspecs). If not, the type is underdetermined
and we get a type error. Similarly if two tyvars that are supposed to
unify clash in some way (integer unify-with string, say) then there is
also a type error.
- me/typestate.ml is a dataflow-based typestate checker. It is responsible
for ensuring all preconditions are met, including init-before-use. It
also determines slot lifecycle boundaries, and populates the context
tables:
ctxt_constr_ids
ctxt_constrs
ctxt_copy_stmt_is_init
ctxt_post_stmt_slot_drops
ctxt_postconditions
ctxt_poststates
ctxt_preconditions
ctxt_prestates
It is organized around constr_keys, a bunch of bitsets, and a CFG.
A constr_key is a normalized value representing a single constraint that
we wish to be able to refer to within a typestate. Every constr_key gets
a bit number assigned to it. A condition (and a typestate) is a
bit-vector, in which the set bits indicate the constr_keys (indexed by
associatd number) that hold in the condition/typestate.
There are 4 such bitsets generated for each node in the CFG:
precondition/postcondition and prestate/poststate. The visitors here
figure out all the constr_keys we'll need, then assign all the pre/post
conditions, generate the CFG, calculate the typestates from the CFG, and
check that every typestate satisfies its precondition.
(Due to the peculiarity that types are pure terms and are not 'a
identified in our AST, we have to do some name-lookup in here as well
when normalizing the const_keys).
- Effect is relatively simple: it calculates the effect of each type and
item, and checks that they either match their declarations or are
authorized to be lying.
- Loop is even simpler: it calculates loop-depth information for later use
generating foreach loops. It populates the context tables:
ctxt_block_is_loop_body
ctxt_slot_loop_depths
ctxt_stmt_loop_depths
- Alias checks slot-aliasing to ensure none of the rules are broken about
simultaneous aliases and such. It also populates the table
ctxt_slot_is_aliased.
- Layout determines the layout of frames, arguments, objects, closures and
such. This includes deciding which slot should go in a vreg and
generating fixups for all frame-spill regions. It populates the context
tables:
ctxt_block_is_loop_body
ctxt_call_sizes
ctxt_frame_blocks
ctxt_frame_sizes
ctxt_slot_is_obj_state
ctxt_slot_offsets
ctxt_slot_vregs
ctxt_spill_fixups
There is a useful chunk of ASCII-art in the leading comment of layout,
if you want to see how a frame goes together, I recommend reading it.
- Trans is the big one. This is the "translate AST to IL" pass, and it's a
bit of a dumping ground, sadly. Probably 4x the size of any other
pass. Stuff that is common to the x86 and LLVM backends is factored out
into transutil.ml, but it hardly helps. Suggestions welcome for
splitting it further.
Trans works *imperatively*. It maintains a stack of emitters, one per
function (or helper-function) and emits Il.quads into the top-of-stack
emitter into while it walks the statements of each function. If at any
point it needs to pause to emit a helper function ("glue function") it
pushes a new emitter onto the stack and emits into that.
Trans populates the context tables:
ctxt_all_item_code
ctxt_block_fixups
ctxt_data
ctxt_file_code
ctxt_file_fixups
ctxt_fn_fixups
ctxt_glue_code
The entries in the tables ending in _code are of type Semant.code, which
is an abstract type covering both function and glue-function code; each
holds an executable block of quads, plus an aggregate count of vregs and
a reference to the spill fixup for that code.
- Once it completes trans, driver/main.ml does the "finishing touches":
register allocates each emitted code value (be/ra.ml), emits dwarf for the
crate (me/dwarf.ml), selects instructions (be/x86.ml), then selects one of
the object-file backends (be/elf.ml, be/macho.ml or be/pe.ml) and emits the
selected Asm.frag to it. Hopefully little of this will require further work;
the most incomplete module here is probably dwarf.ml but the remainder are
mostly stable and don't tend to change much, aside from picking bugs out of
them.
Details and curiosities to note along the way:
==============================================
- Where you might expect there to be a general recursive expression type for
'expr', you'll find only a very limited non-recursive 3-way switch: binary,
unary, or atom; where atom is either a literal or an lval. This is because
all the "big" expressions (pexps) were boiled off during the desugaring
phase in the frontend.
- There are multiple ways to refer to a path. Names, lvals and cargs all
appear to have similar structure (and do). They're all subsets of the
general path grammar, so all follow the rough shape of being either a base
anchor-path or an ext (extension) path with structural recursion to the
left.
Cargs (constraint arguments) are the sort of paths that can be passed to
constraints in the typestate system, and can contain the special symbol "*"
in the grammar, meaning "thing I am attached to". This is the symbol
BASE_formal in the carg_base type.
Names are the sort of paths that refer to types or other items. Not slots.
Lvals are the sort of paths that *might* refer to slots, but we don't
generally know. So they can contain the dynamic-indexing component
COMP_atom. For example, x.(1 + 2).y is an lval.
- Only one of these forms is 'a identified: an lval. And moreover, only the
lval *base* is identified; the remainder of the path has to be projected
forward through the referent after lookup. This also means that when you
lookup anything else by name, you have to be using the result immediately,
not storing it in a table for later.
- Types are not 'a identified. This means that you (generally) cannot refer to
a *particular* occurrence of a type in the AST and associate information
with it. Instead, we treat types as "pure terms" (not carrying identity) and
calculate properties of them on the fly. For this we use a general fold
defined in me/semant.ml, the family of functions held in a ty_fold
structure, and passed to fold_ty.
- There is a possibly-surprising type called "size" in util/common. This is a
type representing a "size expression" that may depend on runtime
information, such as the type descriptors passed to a frame at runtime. This
exists because our type-parameterization scheme is, at the moment,
implemented by passing type descriptors around at runtime, not
code-expansion a la C++ templates. So any time we have a translated indexing
operation or such that depends on a type parameter, we wind up with a size
expression including SIZE_param_size or SIZE_param_align, and have to do
size arithmetic at runtime. Upstream of trans, we generate sizes willy-nilly
and then decide in trans, x86, and dwarf whether they can be emitted
statically or via runtime calculation at the point of use.
- Trans generates position-independent code (PIC). This means that it never
refers to the exact position of a fixup in memory at load-time, always the
distance-to-a-fixup from some other fixup, and/or current PC. On x86 this
means we wind up copying the "get next pc thunk" trick used on linux
systems, and/or storing "crate relative" addresses. The runtime and compiler
"know" (unfortunately sometimes quite obscurely) that an immediate pointer
should be encoded as relative-to a given displacement base, and work with
those as necessary. Similarly, they emit code to reify pointer immediates
(add the displacements to displacement-bases) before handing them off to
(say) C library functions that expect "real" pointers. This is all somewhat
messy.
- There is one central static data structure, "rust_crate", which is emitted
into the final loadable object and contains pointers to all subsequent
information the runtime may be interested in. It also serves as the
displacement base for a variety of PIC-ish displacements stored
elsewhere. When the runtime loads a crate, it dlsym()s rust_crate, and then
digs around in there. It's the entry-point for crawling the crate's
structure from outside. Importantly: it also contains pointers to the dwarf.
- Currently we drive linking off dwarf. That is: when a crate needs to 'use'
an item from another dwarf crate, we dlopen / LoadLibrary and find the
"rust_crate" value, follow its pointers to dwarf tables, and scan around the
dwarf DIE tree resolving the hierarchical name of the used item. This may
change, we decided to recycle dwarf for this purpose early in the language
evolution and may, given the number of simplifications that have occurred
along the way, be able to fall back to C "mangled name" linkage at some
point. Though that decision carries a number of serious constraints, and
should not be taken lightly.
Probably-bad ideas we will want to do differently in the self-hosted compiler:
==============================================================================
- We desugar too early in rustboot and should preserve the pexp structure
until later. Dherman is likely to argue for movement to a more
expression-focused grammar. This may well happen.
- Multiple kinds of paths enforced by numerous nearly-isomorphic ML type
constructors is pointless once we're in rust; we can just make type
abbreviations that carry constraints like path : is_name(*) or such.
- Storing auxiliary information in semant tables is awkward, and we should
figure out a suitably rusty idiom for decorating AST nodes in-place.
Inter-pass dependencies should be managed by augmenting the AST with
ever-more constraints (is_resolved(ast), is_typechecked(ast), etc.)
- Trans should be organized as pure and value-producing code, not imperatively
emitting quads into emitters. LLVM will enforce this anyways. See what
happened in lltrans.ml if you're curious what it'll look (more) like.
- The PIC scheme will have to change, hopefully get much easier.