rust/boot at 3f80e79efc6feaabea960793af1ff5baa1d2f141 - rust

An informal guide to reading and working on the rustboot compiler.
==================================================================

First off, my sincerest apologies for the lightly-commented nature of the
compiler, as well as the general immaturity of the codebase; rustboot is
intended to be discarded in the near future as we transition off it, to a
rust-based, LLVM-backed compiler. It has taken longer than expected for "the
near future" to arrive, and here we are published and attracting contributors
without a good place for them to start. It will be a priority for the next
little while to make new contributors feel welcome and oriented within the
project; best I can do at this point. We were in a tremendous rush even to get
everything organized to this minimal point.

If you wish to expand on this document, or have one of the
slightly-more-familiar authors add anything else to it, please get in touch or
file a bug. Your concerns are probably the same as someone else's.



High-level concepts, invariants, 30,000-ft view
===============================================

Rustboot has 3 main subdirectories: fe, me, and be (front, mid, back
end). Helper modules and ubiquitous types are found in util/.

The entry-point for the compiler is driver/main.ml, and this file sequences
the various parts together.


The 4 central data structures:
------------------------------

#1: fe/ast.ml defines the AST. The AST is treated as immutable after parsing
    despite containing some mutable types (hashtbl and such). Many -- though
    not all -- nodes within this data structure are wrapped in the type 'a
    identified. This is important. An "identified" AST node is one that the
    parser has marked with a unique node_id value. This node_id is used both
    to denote a source location and, more importantly, to key into a large
    number of tables later in the compiler. Most additional calculated
    properties of a program that the compiler derives are keyed to the node_id
    of an identified node.

    The types 'a identified, node_id and such are in util/common.ml


#2: me/semant.ml defines the Semant.ctxt structure. This is a record of
    tables, almost all of which are keyed by node_id. See previous comment
    regrding node_id. The Semant module is open in most of the modules within
    the me/ directory, and they all refer liberally to the ctxt tables, either
    directly or via helper functions in semant. Semant also defines the
    mid-end pass-management logic, lookup routines, type folds, and a variety
    of other miscallaneous semantic-analysis helpers.


#3: be/il.ml defines the IL. This is a small, typed IL based on a type system
    that is relatively LLVM-ish, and a control-flow system that is *not*
    expression/SSA based like LLVM. It's much dumber than that. The root of
    the interesting types in this file is the type 'emitter', which is a
    growable buffer along with a few counters.  An emitter is essentially a
    buffer of quads. A quad, in turn, is a primitive virtual instruction
    ('quad' because it is in its limit a 3-address machine, plus opcode) which
    we then ... tend to turn directly into x86 anyways. Sorry; it wasn't clear
    during initial construction that we'd wind up stopping at x86, so the IL
    is probably superfluous, but there it is.

    The IL types are operand = cell | immediate, and cell = reg | mem. Plus a
    certain quantity of special-casing and noise for constant-pointer
    propagation and addressing modes and whatnot.


#4: be/asm.ml defines the Asm.frag type, which is a "chunk of binary-ish
    stuff" to put in an output file. Words, bytes, lazily-resolved fixups,
    constant expressions, 0-terminated strings, alignment boundaries, etc. You
    will hopefully not need to produce a lot of this yourself; most of this is
    already being emitted.

    An important type that gets resolved here is fixup, from util/common.ml.
    Fixups are things you can wrap around a frag using an Asm.DEF frag, which
    get their size and position (both in-file and in-memory) calculated at
    asm-time; but you can refer to them before they're resolved. So any time
    the compiler needs to refer to "the place / size this thingy will be, when
    it finally gets boiled down to frags and emitted" we generate a fixup and
    use that. Functions and static data structures, for example, tend to get
    fixups assigned to them early on in the middle-end of the compiler.


Control and information flow within the compiler:
-------------------------------------------------

- driver/main.ml assumes control on startup. Options are parsed, platform is
  detected, etc.


- fe/lexer.ml does lexing in any case; fe/parser.ml holds the fundamental
  parser-state and parser-combinator functions. Parsing rules are split
  between 3 files: fe/cexp.ml, fe/pexp.ml, and fe/item.ml. This split
  represents the general structure of the grammar(s):

    - The outermost grammar is called "cexp" (crate expression), and is an
      expression language that describes the crate directives found in crate
      files. It's evaluated inside the compiler.

    - The next grammar is "item", which is a statement language that describes
      the directives, declarations and statements found in source files. If
      you compile a naked source file, you jump straight to item and then
      synthesize a simple crate structure around the result.

    - The innermost grammar is "pexp" (parsed expression), and is an
      expression language used for the shared expression grammar within both
      cexp and item. Pexps within cexp are evaluated in the compiler
      (non-constant, complex cexps are errors) whereas pexps within items are
      desugared to statements and primitive expressions.

    - The AST is the output from the item grammar. Pexp and cexp do not escape
      the front-end.


- driver/main.ml then builds a Semant.ctxt and threads it through the various
  middle-end passes. Each pass defines one or more visitors, which is an FRU
  copy of the empty_visitor in me/walk.ml. Each visitor performs a particular
  task, encapsulates some local state in local variables, and leaves its
  results in a table. If the table it's calculating is pass-local, it will be
  a local binding within the pass; if it's to be shared with later passes, it
  will be a table in Semant.ctxt. Pass order is therefore somewhat important,
  so I'll describe it here:

    - me/resolve.ml looks up names and connects them to definitions. This
      includes expanding all types (as types can occur within names, as part
      of a parametric name) and performing all import/export/visibility
      judgments. After resolve, we should not be doing any further name-based
      lookups (with one exception: typestate does some more name
      lookup. Subtle reason, will return to it).

      Resolve populates several of the tables near the top of Semant.ctxt:

          ctxt_all_cast_types
          ctxt_all_defns
          ctxt_all_item_names
          ctxt_all_item_types
          ctxt_all_lvals
          ctxt_all_stmts
          ctxt_all_type_items
          ctxt_block_items
          ctxt_block_slots
          ctxt_frame_args
          ctxt_lval_to_referent
          ctxt_node_referenced
          ctxt_required_items
          ctxt_slot_is_arg
          ctxt_slot_keys

      The most obviously critical of these are lval_to_referent and all_defns,
      which connect subsequent visitors from a reference node to its referent
      node, and catalogue all the possible things a referent may be.

      Part of resolving that is perhaps not obvious is the task of resolving
      and normalizing recursive types. This is what TY_iso is for. Recursive
      types in rust have to pass through a tag type on their recursive edge;
      TY_iso is an iso-recursive group of tags that refer only to one another;
      within a TY_iso, the type term "TY_idx n" refers to "the nth member of
      the current TY_iso". Resolve is responsible for finding such groups and
      tying them into such closed-form knots.

      TY_name should be completely eliminated in any of the types exiting
      resolve.


    - me/type.ml is a unification-based typechecker and inference engine. This
      is as textbook-y as we could make it. It rewrites "auto" slots in the
      ctxt_all_defns table when it completes (these are the slots with None as
      their Ast.slot_ty).

      This file is organized around tyspecs and tyvars.  A tyspec is a
      constraint on an unknown type that is implied by its use; tyspecs are
      generated during the AST-walk, placed in ref cells (tyvars), and the
      cells are and unified with one another. If two tyvars unify, then a new
      constraint is created with the tighter of the two and the two previous
      tyvars are updated to point to the unified spec. Ideally all constraints
      eventually run into a source of a concrete type (or a type otherwise
      uniquely-determined by its tyspecs). If not, the type is underdetermined
      and we get a type error. Similarly if two tyvars that are supposed to
      unify clash in some way (integer unify-with string, say) then there is
      also a type error.


    - me/typestate.ml is a dataflow-based typestate checker. It is responsible
      for ensuring all preconditions are met, including init-before-use. It
      also determines slot lifecycle boundaries, and populates the context
      tables:

          ctxt_constr_ids
          ctxt_constrs
          ctxt_copy_stmt_is_init
          ctxt_post_stmt_slot_drops
          ctxt_postconditions
          ctxt_poststates
          ctxt_preconditions
          ctxt_prestates

      It is organized around constr_keys, a bunch of bitsets, and a CFG.

      A constr_key is a normalized value representing a single constraint that
      we wish to be able to refer to within a typestate. Every constr_key gets
      a bit number assigned to it. A condition (and a typestate) is a
      bit-vector, in which the set bits indicate the constr_keys (indexed by
      associatd number) that hold in the condition/typestate.

      There are 4 such bitsets generated for each node in the CFG:
      precondition/postcondition and prestate/poststate. The visitors here
      figure out all the constr_keys we'll need, then assign all the pre/post
      conditions, generate the CFG, calculate the typestates from the CFG, and
      check that every typestate satisfies its precondition.

      (Due to the peculiarity that types are pure terms and are not 'a
      identified in our AST, we have to do some name-lookup in here as well
      when normalizing the const_keys).


    - Effect is relatively simple: it calculates the effect of each type and
      item, and checks that they either match their declarations or are
      authorized to be lying.


    - Loop is even simpler: it calculates loop-depth information for later use
      generating foreach loops. It populates the context tables:

          ctxt_block_is_loop_body
          ctxt_slot_loop_depths
          ctxt_stmt_loop_depths


    - Alias checks slot-aliasing to ensure none of the rules are broken about
      simultaneous aliases and such. It also populates the table
      ctxt_slot_is_aliased.


    - Layout determines the layout of frames, arguments, objects, closures and
      such. This includes deciding which slot should go in a vreg and
      generating fixups for all frame-spill regions. It populates the context
      tables:

          ctxt_block_is_loop_body
          ctxt_call_sizes
          ctxt_frame_blocks
          ctxt_frame_sizes
          ctxt_slot_is_obj_state
          ctxt_slot_offsets
          ctxt_slot_vregs
          ctxt_spill_fixups

      There is a useful chunk of ASCII-art in the leading comment of layout,
      if you want to see how a frame goes together, I recommend reading it.


    - Trans is the big one. This is the "translate AST to IL" pass, and it's a
      bit of a dumping ground, sadly. Probably 4x the size of any other
      pass. Stuff that is common to the x86 and LLVM backends is factored out
      into transutil.ml, but it hardly helps. Suggestions welcome for
      splitting it further.

      Trans works *imperatively*. It maintains a stack of emitters, one per
      function (or helper-function) and emits Il.quads into the top-of-stack
      emitter into while it walks the statements of each function. If at any
      point it needs to pause to emit a helper function ("glue function") it
      pushes a new emitter onto the stack and emits into that.

      Trans populates the context tables:

          ctxt_all_item_code
          ctxt_block_fixups
          ctxt_data
          ctxt_file_code
          ctxt_file_fixups
          ctxt_fn_fixups
          ctxt_glue_code

      The entries in the tables ending in _code are of type Semant.code, which
      is an abstract type covering both function and glue-function code; each
      holds an executable block of quads, plus an aggregate count of vregs and
      a reference to the spill fixup for that code.


- Once it completes trans, driver/main.ml does the "finishing touches":
  register allocates each emitted code value (be/ra.ml), emits dwarf for the
  crate (me/dwarf.ml), selects instructions (be/x86.ml), then selects one of
  the object-file backends (be/elf.ml, be/macho.ml or be/pe.ml) and emits the
  selected Asm.frag to it. Hopefully little of this will require further work;
  the most incomplete module here is probably dwarf.ml but the remainder are
  mostly stable and don't tend to change much, aside from picking bugs out of
  them.



Details and curiosities to note along the way:
==============================================

- Where you might expect there to be a general recursive expression type for
  'expr', you'll find only a very limited non-recursive 3-way switch: binary,
  unary, or atom; where atom is either a literal or an lval. This is because
  all the "big" expressions (pexps) were boiled off during the desugaring
  phase in the frontend.


- There are multiple ways to refer to a path. Names, lvals and cargs all
  appear to have similar structure (and do). They're all subsets of the
  general path grammar, so all follow the rough shape of being either a base
  anchor-path or an ext (extension) path with structural recursion to the
  left.

  Cargs (constraint arguments) are the sort of paths that can be passed to
  constraints in the typestate system, and can contain the special symbol "*"
  in the grammar, meaning "thing I am attached to".  This is the symbol
  BASE_formal in the carg_base type.

  Names are the sort of paths that refer to types or other items. Not slots.

  Lvals are the sort of paths that *might* refer to slots, but we don't
  generally know. So they can contain the dynamic-indexing component
  COMP_atom. For example, x.(1 + 2).y is an lval.


- Only one of these forms is 'a identified: an lval. And moreover, only the
  lval *base* is identified; the remainder of the path has to be projected
  forward through the referent after lookup. This also means that when you
  lookup anything else by name, you have to be using the result immediately,
  not storing it in a table for later.


- Types are not 'a identified. This means that you (generally) cannot refer to
  a *particular* occurrence of a type in the AST and associate information
  with it. Instead, we treat types as "pure terms" (not carrying identity) and
  calculate properties of them on the fly. For this we use a general fold
  defined in me/semant.ml, the family of functions held in a ty_fold
  structure, and passed to fold_ty.


- There is a possibly-surprising type called "size" in util/common. This is a
  type representing a "size expression" that may depend on runtime
  information, such as the type descriptors passed to a frame at runtime. This
  exists because our type-parameterization scheme is, at the moment,
  implemented by passing type descriptors around at runtime, not
  code-expansion a la C++ templates. So any time we have a translated indexing
  operation or such that depends on a type parameter, we wind up with a size
  expression including SIZE_param_size or SIZE_param_align, and have to do
  size arithmetic at runtime. Upstream of trans, we generate sizes willy-nilly
  and then decide in trans, x86, and dwarf whether they can be emitted
  statically or via runtime calculation at the point of use.


- Trans generates position-independent code (PIC). This means that it never
  refers to the exact position of a fixup in memory at load-time, always the
  distance-to-a-fixup from some other fixup, and/or current PC. On x86 this
  means we wind up copying the "get next pc thunk" trick used on linux
  systems, and/or storing "crate relative" addresses. The runtime and compiler
  "know" (unfortunately sometimes quite obscurely) that an immediate pointer
  should be encoded as relative-to a given displacement base, and work with
  those as necessary.  Similarly, they emit code to reify pointer immediates
  (add the displacements to displacement-bases) before handing them off to
  (say) C library functions that expect "real" pointers. This is all somewhat
  messy.


- There is one central static data structure, "rust_crate", which is emitted
  into the final loadable object and contains pointers to all subsequent
  information the runtime may be interested in. It also serves as the
  displacement base for a variety of PIC-ish displacements stored
  elsewhere. When the runtime loads a crate, it dlsym()s rust_crate, and then
  digs around in there. It's the entry-point for crawling the crate's
  structure from outside. Importantly: it also contains pointers to the dwarf.


- Currently we drive linking off dwarf. That is: when a crate needs to 'use'
  an item from another dwarf crate, we dlopen / LoadLibrary and find the
  "rust_crate" value, follow its pointers to dwarf tables, and scan around the
  dwarf DIE tree resolving the hierarchical name of the used item. This may
  change, we decided to recycle dwarf for this purpose early in the language
  evolution and may, given the number of simplifications that have occurred
  along the way, be able to fall back to C "mangled name" linkage at some
  point. Though that decision carries a number of serious constraints, and
  should not be taken lightly.



Probably-bad ideas we will want to do differently in the self-hosted compiler:
==============================================================================

- We desugar too early in rustboot and should preserve the pexp structure
  until later. Dherman is likely to argue for movement to a more
  expression-focused grammar. This may well happen.

- Multiple kinds of paths enforced by numerous nearly-isomorphic ML type
  constructors is pointless once we're in rust; we can just make type
  abbreviations that carry constraints like path : is_name(*) or such.

- Storing auxiliary information in semant tables is awkward, and we should
  figure out a suitably rusty idiom for decorating AST nodes in-place.
  Inter-pass dependencies should be managed by augmenting the AST with
  ever-more constraints (is_resolved(ast), is_typechecked(ast), etc.)

- Trans should be organized as pure and value-producing code, not imperatively
  emitting quads into emitters. LLVM will enforce this anyways. See what
  happened in lltrans.ml if you're curious what it'll look (more) like.

- The PIC scheme will have to change, hopefully get much easier.