rework the README.md for rustc and add other readmes

This takes way longer than I thought it would. =)
This commit is contained in:
Niko Matsakis 2017-08-31 14:33:19 -04:00
parent 9a00f3cc30
commit 44e45d9fea
11 changed files with 463 additions and 41 deletions

View File

@ -13,49 +13,82 @@ https://github.com/rust-lang/rust/issues
Your concerns are probably the same as someone else's.
You may also be interested in the
[Rust Forge](https://forge.rust-lang.org/), which includes a number of
interesting bits of information.
Finally, at the end of this file is a GLOSSARY defining a number of
common (and not necessarily obvious!) names that are used in the Rust
compiler code. If you see some funky name and you'd like to know what
it stands for, check there!
The crates of rustc
===================
Rustc consists of a number of crates, including `libsyntax`,
`librustc`, `librustc_back`, `librustc_trans`, and `librustc_driver`
(the names and divisions are not set in stone and may change;
in general, a finer-grained division of crates is preferable):
Rustc consists of a number of crates, including `syntax`,
`rustc`, `rustc_back`, `rustc_trans`, `rustc_driver`, and
many more. The source for each crate can be found in a directory
like `src/libXXX`, where `XXX` is the crate name.
- [`libsyntax`][libsyntax] contains those things concerned purely with syntax
that is, the AST, parser, pretty-printer, lexer, macro expander, and
utilities for traversing ASTs are in a separate crate called
"syntax", whose files are in `./../libsyntax`, where `.` is the
current directory (that is, the parent directory of front/, middle/,
back/, and so on).
(NB. The names and divisions of these crates are not set in
stone and may change over time -- for the time being, we tend towards
a finer-grained division to help with compilation time, though as
incremental improves that may change.)
- `librustc` (the current directory) contains the high-level analysis
passes, such as the type checker, borrow checker, and so forth.
It is the heart of the compiler.
The dependency structure of these crates is roughly a diamond:
- [`librustc_back`][back] contains some very low-level details that are
specific to different LLVM targets and so forth.
- [`librustc_trans`][trans] contains the code to convert from Rust IR into LLVM
IR, and then from LLVM IR into machine code, as well as the main
driver that orchestrates all the other passes and various other bits
of miscellany. In general it contains code that runs towards the
end of the compilation process.
- [`librustc_driver`][driver] invokes the compiler from
[`libsyntax`][libsyntax], then the analysis phases from `librustc`, and
finally the lowering and codegen passes from [`librustc_trans`][trans].
Roughly speaking the "order" of the three crates is as follows:
librustc_driver
|
+-----------------+-------------------+
| |
libsyntax -> librustc -> librustc_trans
````
rustc_driver
/ | \
/ | \
/ | \
/ v \
rustc_trans rustc_borrowck ... rustc_metadata
\ | /
\ | /
\ | /
\ v /
rustc
|
v
syntax
/ \
/ \
syntax_pos syntax_ext
```
The compiler process:
=====================
The idea is that `rustc_driver`, at the top of this lattice, basically
defines the overall control-flow of the compiler. It doesn't have much
"real code", but instead ties together all of the code defined in the
other crates and defines the overall flow of execution.
At the other extreme, the `rustc` crate defines the common and
pervasive data structures that all the rest of the compiler uses
(e.g., how to represent types, traits, and the program itself). It
also contains some amount of the compiler itself, although that is
relatively limited.
Finally, all the crates in the bulge in the middle define the bulk of
the compiler -- they all depend on `rustc`, so that they can make use
of the various types defined there, and they export public routines
that `rustc_driver` will invoke as needed (more and more, what these
crates export are "query definitions", but those are covered later
on).
Below `rustc` lie various crates that make up the parser and error
reporting mechanism. For historical reasons, these crates do not have
the `rustc_` prefix, but they are really just as much an internal part
of the compiler and not intended to be stable (though they do wind up
getting used by some crates in the wild; a practice we hope to
gradually phase out).
Each crate has a `README.md` file that describes, at a high-level,
what it contains, and tries to give some kind of explanation (some
better than others).
The compiler process
====================
The Rust compiler is comprised of six main compilation phases.
@ -172,3 +205,29 @@ The 3 central data structures:
[back]: https://github.com/rust-lang/rust/tree/master/src/librustc_back/
[rustc]: https://github.com/rust-lang/rust/tree/master/src/librustc/
[driver]: https://github.com/rust-lang/rust/tree/master/src/librustc_driver
Glossary
========
The compiler uses a number of...idiosyncratic abbreviations and
things. This glossary attempts to list them and give you a few
pointers for understanding them better.
- AST -- the **abstract syntax tree** produced the `syntax` crate; reflects user syntax
very closely.
- cx -- we tend to use "cx" as an abbrevation for context. See also tcx, infcx, etc.
- HIR -- the **High-level IR**, created by lowering and desugaring the AST. See `librustc/hir`.
- `'gcx` -- the lifetime of the global arena (see `librustc/ty`).
- generics -- the set of generic type parameters defined on a type or item
- infcx -- the inference context (see `librustc/infer`)
- MIR -- the **Mid-level IR** that is created after type-checking for use by borrowck and trans.
Defined in the `src/librustc/mir/` module, but much of the code that manipulates it is
found in `src/librustc_mir`.
- obligation -- something that must be proven by the trait system.
- sess -- the **compiler session**, which stores global data used throughout compilation
- substs -- the **substitutions** for a given generic type or item
(e.g., the `i32, u32` in `HashMap<i32, u32>`)
- tcx -- the "typing context", main data structure of the compiler (see `librustc/ty`).
- trans -- the code to **translate** MIR into LLVM IR.
- trait reference -- a trait and values for its type parameters (see `librustc/ty`).
- ty -- the internal representation of a **type** (see `librustc/ty`).

123
src/librustc/hir/README.md Normal file
View File

@ -0,0 +1,123 @@
# Introduction to the HIR
The HIR -- "High-level IR" -- is the primary IR used in most of
rustc. It is a desugared version of the "abstract syntax tree" (AST)
that is generated after parsing, macro expansion, and name resolution
have completed. Many parts of HIR resemble Rust surface syntax quite
closely, with the exception that some of Rust's expression forms have
been desugared away (as an example, `for` loops are converted into a
`loop` and do not appear in the HIR).
This README covers the main concepts of the HIR.
### Out-of-band storage and the `Crate` type
The top-level data-structure in the HIR is the `Crate`, which stores
the contents of the crate currently being compiled (we only ever
construct HIR for the current crate). Whereas in the AST the crate
data structure basically just contains the root module, the HIR
`Crate` structure contains a number of maps and other things that
serve to organize the content of the crate for easier access.
For example, the contents of individual items (e.g., modules,
functions, traits, impls, etc) in the HIR are not immediately
accessible in the parents. So, for example, if had a module item `foo`
containing a function `bar()`:
```
mod foo {
fn bar() { }
}
```
Then in the HIR the representation of module `foo` (the `Mod`
stuct) would have only the **`ItemId`** `I` of `bar()`. To get the
details of the function `bar()`, we would lookup `I` in the
`items` map.
One nice result from this representation is that one can iterate
over all items in the crate by iterating over the key-value pairs
in these maps (without the need to trawl through the IR in total).
There are similar maps for things like trait items and impl items,
as well as "bodies" (explained below).
The other reason to setup the representation this way is for better
integration with incremental compilation. This way, if you gain access
to a `&hir::Item` (e.g. for the mod `foo`), you do not immediately
gain access to the contents of the function `bar()`. Instead, you only
gain access to the **id** for `bar()`, and you must some function to
lookup the contents of `bar()` given its id; this gives us a change to
observe that you accessed the data for `bar()` and record the
dependency.
### Identifiers in the HIR
Most of the code that has to deal with things in HIR tends not to
carry around references into the HIR, but rather to carry around
*identifier numbers* (or just "ids"). Right now, you will find four
sorts of identifiers in active use:
- `DefId`, which primarily name "definitions" or top-level items.
- You can think of a `DefId` as being shorthand for a very explicit
and complete path, like `std::collections::HashMap`. However,
these paths are able to name things that are not nameable in
normal Rust (e.g., impls), and they also include extra information
about the crate (such as its version number, as two versions of
the same crate can co-exist).
- A `DefId` really consists of two parts, a `CrateNum` (which
identifies the crate) and a `DefIndex` (which indixes into a list
of items that is maintained per crate).
- `HirId`, which combines the index of a particular item with an
offset within that item.
- the key point of a `HirId` is that it is *relative* to some item (which is named
via a `DefId`).
- `BodyId`, this is an absolute identifier that refers to a specific
body (definition of a function or constant) in the crate. It is currently
effectively a "newtype'd" `NodeId`.
- `NodeId`, which is an absolute id that identifies a single node in the HIR tree.
- While these are still in common use, **they are being slowly phased out**.
- Since they are absolute within the crate, adding a new node
anywhere in the tree causes the node-ids of all subsequent code in
the crate to change. This is terrible for incremental compilation,
as you can perhaps imagine.
### HIR Map
Most of the time when you are working with the HIR, you will do so via
the **HIR Map**, accessible in the tcx via `tcx.hir` (and defined in
the `hir::map` module). The HIR map contains a number of methods to
convert between ids of various kinds and to lookup data associated
with a HIR node.
For example, if you have a `DefId`, and you would like to convert it
to a `NodeId`, you can use `tcx.hir.as_local_node_id(def_id)`. This
returns an `Option<NodeId>` -- this will be `None` if the def-id
refers to something outside of the current crate (since then it has no
HIR node), but otherwise returns `Some(n)` where `n` is the node-id of
the definition.
Similarly, you can use `tcx.hir.find(n)` to lookup the node for a
`NodeId`. This returns a `Option<Node<'tcx>>`, where `Node` is an enum
defined in the map; by matching on this you can find out what sort of
node the node-id referred to and also get a pointer to the data
itself. Often, you know what sort of node `n` is -- e.g., if you know
that `n` must be some HIR expression, you can do
`tcx.hir.expect_expr(n)`, which will extract and return the
`&hir::Expr`, panicking if `n` is not in fact an expression.
Finally, you can use the HIR map to find the parents of nodes, via
calls like `tcx.hir.get_parent_node(n)`.
### HIR Bodies
A **body** represents some kind of executable code, such as the body
of a function/closure or the definition of a constant. Bodies are
associated with an **owner**, which is typically some kind of item
(e.g., a `fn()` or `const`), but could also be a closure expression
(e.g., `|x, y| x + y`). You can use the HIR map to find find the body
associated with a given def-id (`maybe_body_owned_by()`) or to find
the owner of a body (`body_owner_def_id()`).

View File

@ -0,0 +1,4 @@
The HIR map, accessible via `tcx.hir`, allows you to quickly navigate the
HIR and convert between various forms of identifiers. See [the HIR README] for more information.
[the HIR README]: ../README.md

View File

@ -413,6 +413,9 @@ pub struct WhereEqPredicate {
pub type CrateConfig = HirVec<P<MetaItem>>;
/// The top-level data structure that stores the entire contents of
/// the crate currently being compiled.
///
#[derive(Clone, PartialEq, Eq, RustcEncodable, RustcDecodable, Debug)]
pub struct Crate {
pub module: Mod,
@ -927,7 +930,27 @@ pub struct BodyId {
pub node_id: NodeId,
}
/// The body of a function or constant value.
/// The body of a function, closure, or constant value. In the case of
/// a function, the body contains not only the function body itself
/// (which is an expression), but also the argument patterns, since
/// those are something that the caller doesn't really care about.
///
/// Example:
///
/// ```rust
/// fn foo((x, y): (u32, u32)) -> u32 {
/// x + y
/// }
/// ```
///
/// Here, the `Body` associated with `foo()` would contain:
///
/// - an `arguments` array containing the `(x, y)` pattern
/// - a `value` containing the `x + y` expression (maybe wrapped in a block)
/// - `is_generator` would be false
///
/// All bodies have an **owner**, which can be accessed via the HIR
/// map using `body_owner_def_id()`.
#[derive(Clone, PartialEq, Eq, RustcEncodable, RustcDecodable, Hash, Debug)]
pub struct Body {
pub arguments: HirVec<Arg>,

View File

@ -8,7 +8,28 @@
// option. This file may not be copied, modified, or distributed
// except according to those terms.
//! The Rust compiler.
//! The "main crate" of the Rust compiler. This crate contains common
//! type definitions that are used by the other crates in the rustc
//! "family". Some prominent examples (note that each of these modules
//! has their own README with further details).
//!
//! - **HIR.** The "high-level (H) intermediate representation (IR)" is
//! defined in the `hir` module.
//! - **MIR.** The "mid-level (M) intermediate representation (IR)" is
//! defined in the `mir` module. This module contains only the
//! *definition* of the MIR; the passes that transform and operate
//! on MIR are found in `librustc_mir` crate.
//! - **Types.** The internal representation of types used in rustc is
//! defined in the `ty` module. This includes the **type context**
//! (or `tcx`), which is the central context during most of
//! compilation, containing the interners and other things.
//! - **Traits.** Trait resolution is implemented in the `traits` module.
//! - **Type inference.** The type inference code can be found in the `infer` module;
//! this code handles low-level equality and subtyping operations. The
//! type check pass in the compiler is found in the `librustc_typeck` crate.
//!
//! For a deeper explanation of how the compiler works and is
//! organized, see the README.md file in this directory.
//!
//! # Note
//!

159
src/librustc/ty/README.md Normal file
View File

@ -0,0 +1,159 @@
# Types and the Type Context
The `ty` module defines how the Rust compiler represents types
internally. It also defines the *typing context* (`tcx` or `TyCtxt`),
which is the central data structure in the compiler.
## The tcx and how it uses lifetimes
The `tcx` ("typing context") is the central data structure in the
compiler. It is the context that you use to perform all manner of
queries. The struct `TyCtxt` defines a reference to this shared context:
```rust
tcx: TyCtxt<'a, 'gcx, 'tcx>
// -- ---- ----
// | | |
// | | innermost arena lifetime (if any)
// | "global arena" lifetime
// lifetime of this reference
```
As you can see, the `TyCtxt` type takes three lifetime parameters.
These lifetimes are perhaps the most complex thing to understand about
the tcx. During rust compilation, we allocate most of our memory in
**arenas**, which are basically pools of memory that get freed all at
once. When you see a reference with a lifetime like `'tcx` or `'gcx`,
you know that it refers to arena-allocated data (or data that lives as
long as the arenas, anyhow).
We use two distinct levels of arenas. The outer level is the "global
arena". This arena lasts for the entire compilation: so anything you
allocate in there is only freed once compilation is basically over
(actually, when we shift to executing LLVM).
To reduce peak memory usage, when we do type inference, we also use an
inner level of arena. These arenas get thrown away once type inference
is over. This is done because type inference generates a lot of
"throw-away" types that are not particularly interesting after type
inference completes, so keeping around those allocations would be
wasteful.
Often, we wish to write code that explicitly asserts that it is not
taking place during inference. In that case, there is no "local"
arena, and all the types that you can access are allocated in the
global arena. To express this, the idea is to us the same lifetime
for the `'gcx` and `'tcx` parameters of `TyCtxt`. Just to be a touch
confusing, we tend to use the name `'tcx` in such contexts. Here is an
example:
```rust
fn not_in_inference<'a, 'tcx>(tcx: TyCtxt<'a, 'tcx, 'tcx>, def_id: DefId) {
// ---- ----
// Using the same lifetime here asserts
// that the innermost arena accessible through
// this reference *is* the global arena.
}
```
In contrast, if we want to code that can be usable during type inference, then you
need to declare a distinct `'gcx` and `'tcx` lifetime parameter:
```rust
fn maybe_in_inference<'a, 'gcx, 'tcx>(tcx: TyCtxt<'a, 'gcx, 'tcx>, def_id: DefId) {
// ---- ----
// Using different lifetimes here means that
// the innermost arena *may* be distinct
// from the global arena (but doesn't have to be).
}
```
### Allocating and working with types
Rust types are represented using the `ty::Ty<'tcx>` type. This is in fact a simple type alias
for a reference with `'tcx` lifetime:
```rust
pub type Ty<'tcx> = &'tcx TyS<'tcx>;
```
The `TyS` struct defines the actual details of how a type is
represented. The most interesting part of it is the `sty` field, which
contains an enum that lets us test what sort of type this is. For
example, it is very common to see code that tests what sort of type you have
that looks roughly like so:
```rust
fn test_type<'tcx>(ty: Ty<'tcx>) {
match ty.sty {
ty::TyArray(elem_ty, len) => { ... }
...
}
}
```
(Note though that doing such low-level tests on types during inference
can be risky, as there are may be inference variables and other things
to consider, or sometimes types are not yet known that will become
known later.).
To allocate a new type, you can use the various `mk_` methods defined
on the `tcx`. These have names that correpond mostly to the various kinds
of type variants. For example:
```rust
let array_ty = tcx.mk_array(elem_ty, len * 2);
```
These methods all return a `Ty<'tcx>` -- note that the lifetime you
get back is the lifetime of the innermost arena that this `tcx` has
access to. In fact, types are always canonicalized and interned (so we
never allocate exactly the same type twice) and are always allocated
in the outermost arena where they can be (so, if they do not contain
any inference variables or other "temporary" types, they will be
allocated in the global arena). However, the lifetime `'tcx` is always
a safe approximation, so that is what you get back.
NB. Because types are interned, it is possible to compare them for
equality efficiently using `==` -- however, this is almost never what
you want to do unless you happen to be hashing and looking for
duplicates. This is because often in Rust there are multiple ways to
represent the same type, particularly once inference is involved. If
you are going to be testing for type equality, you probably need to
start looking into the inference code to do it right.
You can also find various common types in the tcx itself by accessing
`tcx.types.bool`, `tcx.types.char`, etc (see `CommonTypes` for more).
### Beyond types: Other kinds of arena-allocated data structures
In addition to types, there are a number of other arena-allocated data
structures that you can allocate, and which are found in this
module. Here are a few examples:
- `Substs`, allocated with `mk_substs` -- this will intern a slice of types, often used to
specify the values to be substituted for generics (e.g., `HashMap<i32, u32>`
would be represented as a slice `&'tcx [tcx.types.i32, tcx.types.u32]`.
- `TraitRef`, typically passed by value -- a **trait reference**
consists of a reference to a trait along with its various type
parameters (including `Self`), like `i32: Display` (here, the def-id
would reference the `Display` trait, and the substs would contain
`i32`).
- `Predicate` defines something the trait system has to prove (see `traits` module).
### Import conventions
Although there is no hard and fast rule, the `ty` module tends to be used like so:
```rust
use ty::{self, Ty, TyCtxt};
```
In particular, since they are so common, the `Ty` and `TyCtxt` types
are imported directly. Other types are often referenced with an
explicit `ty::` prefix (e.g., `ty::TraitRef<'tcx>`). But some modules
choose to import a larger or smaller set of names explicitly.

View File

@ -793,9 +793,11 @@ fn new(interners: &CtxtInterners<'tcx>) -> CommonTypes<'tcx> {
}
}
/// The data structure to keep track of all the information that typechecker
/// generates so that so that it can be reused and doesn't have to be redone
/// later on.
/// The central data structure of the compiler. Keeps track of all the
/// information that typechecker generates so that so that it can be
/// reused and doesn't have to be redone later on.
///
/// See [the README](README.md) for more deatils.
#[derive(Copy, Clone)]
pub struct TyCtxt<'a, 'gcx: 'a+'tcx, 'tcx: 'a> {
gcx: &'a GlobalCtxt<'gcx>,

View File

@ -0,0 +1,6 @@
NB: This crate is part of the Rust compiler. For an overview of the
compiler as a whole, see
[the README.md file found in `librustc`](../librustc/README.md).
`librustc_back` contains some very low-level details that are
specific to different LLVM targets and so forth.

View File

@ -0,0 +1,12 @@
NB: This crate is part of the Rust compiler. For an overview of the
compiler as a whole, see
[the README.md file found in `librustc`](../librustc/README.md).
The `driver` crate is effectively the "main" function for the rust
compiler. It orchstrates the compilation process and "knits together"
the code from the other crates within rustc. This crate itself does
not contain any of the "main logic" of the compiler (though it does
have some code related to pretty printing or other minor compiler
options).

View File

@ -1 +1,7 @@
See [librustc/README.md](../librustc/README.md).
NB: This crate is part of the Rust compiler. For an overview of the
compiler as a whole, see
[the README.md file found in `librustc`](../librustc/README.md).
The `trans` crate contains the code to convert from MIR into LLVM IR,
and then from LLVM IR into machine code. In general it contains code
that runs towards the end of the compilation process.

7
src/libsyntax/README.md Normal file
View File

@ -0,0 +1,7 @@
NB: This crate is part of the Rust compiler. For an overview of the
compiler as a whole, see
[the README.md file found in `librustc`](../librustc/README.md).
The `syntax` crate contains those things concerned purely with syntax
that is, the AST ("abstract syntax tree"), parser, pretty-printer,
lexer, macro expander, and utilities for traversing ASTs.