2016-04-08 15:37:17 -05:00
|
|
|
% vim: tw=100
|
|
|
|
|
|
|
|
\documentclass[twocolumn]{article}
|
|
|
|
\usepackage{blindtext}
|
2016-04-08 20:54:03 -05:00
|
|
|
\usepackage[hypcap]{caption}
|
2016-04-08 15:37:17 -05:00
|
|
|
\usepackage{fontspec}
|
|
|
|
\usepackage[colorlinks, urlcolor={blue!80!black}]{hyperref}
|
2016-04-08 20:54:03 -05:00
|
|
|
\usepackage[outputdir=out]{minted}
|
2016-04-08 15:37:17 -05:00
|
|
|
\usepackage{relsize}
|
|
|
|
\usepackage{xcolor}
|
|
|
|
|
2016-04-09 20:36:55 -05:00
|
|
|
\setmonofont{Source Code Pro}[
|
|
|
|
BoldFont={* Medium},
|
|
|
|
BoldItalicFont={* Medium Italic},
|
|
|
|
Scale=MatchLowercase,
|
|
|
|
]
|
|
|
|
|
2016-04-08 20:54:03 -05:00
|
|
|
\newcommand{\rust}[1]{\mintinline{rust}{#1}}
|
|
|
|
|
2016-04-08 15:37:17 -05:00
|
|
|
\begin{document}
|
|
|
|
|
|
|
|
\title{Miri: \\ \smaller{An interpreter for Rust's mid-level intermediate representation}}
|
|
|
|
% \subtitle{test}
|
|
|
|
\author{Scott Olson\footnote{\href{mailto:scott@solson.me}{scott@solson.me}} \\
|
|
|
|
\smaller{Supervised by Christopher Dutchyn}}
|
|
|
|
\date{April 8th, 2016}
|
|
|
|
\maketitle
|
|
|
|
|
2016-04-09 20:36:55 -05:00
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
2016-04-08 15:37:17 -05:00
|
|
|
\section{Abstract}
|
|
|
|
|
|
|
|
The increasing need for safe low-level code in contexts like operating systems and browsers is
|
|
|
|
driving the development of Rust\footnote{\url{https://www.rust-lang.org}}, a programming language
|
|
|
|
backed by Mozilla promising blazing speed without the segfaults. To make programming more
|
|
|
|
convenient, it's often desirable to be able to generate code or perform some computation at
|
|
|
|
compile-time. The former is mostly covered by Rust's existing macro feature, but the latter is
|
|
|
|
currently restricted to a limited form of constant evaluation capable of little beyond simple math.
|
|
|
|
|
|
|
|
When the existing constant evaluator was built, it would have been difficult to make it more
|
|
|
|
powerful than it is. However, a new intermediate representation was recently
|
|
|
|
added\footnote{\href{https://github.com/rust-lang/rfcs/blob/master/text/1211-mir.md}{The MIR RFC}}
|
|
|
|
to the Rust compiler between the abstract syntax tree and the back-end LLVM IR, called mid-level
|
|
|
|
intermediate representation, or MIR for short. As it turns out, writing an interpreter for MIR is a
|
|
|
|
surprisingly effective approach for supporting a large proportion of Rust's features in compile-time
|
|
|
|
execution.
|
|
|
|
|
2016-04-09 20:36:55 -05:00
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
2016-04-08 20:54:03 -05:00
|
|
|
\section{Background}
|
2016-04-08 15:37:17 -05:00
|
|
|
|
2016-04-09 20:36:55 -05:00
|
|
|
The Rust compiler generates an instance of \rust{Mir} for each function [\autoref{fig:mir}]. Each
|
|
|
|
\rust{Mir} structure represents a control-flow graph for a given function, and contains a list of
|
|
|
|
``basic blocks'' which in turn contain a list of statements followed by a single terminator. Each
|
|
|
|
statement is of the form \rust{lvalue = rvalue}. An \rust{Lvalue} is used for referencing variables
|
|
|
|
and calculating addresses such as when dereferencing pointers, accessing fields, or indexing arrays.
|
|
|
|
An \rust{Rvalue} represents the core set of operations possible in MIR, including reading a value
|
|
|
|
from an lvalue, performing math operations, creating new pointers, structs, and arrays, and so on.
|
|
|
|
Finally, a terminator decides where control will flow next, optionally based on a boolean or some
|
|
|
|
other condition.
|
2016-04-08 20:54:03 -05:00
|
|
|
|
|
|
|
\begin{figure}[ht]
|
|
|
|
\begin{minted}[autogobble]{rust}
|
|
|
|
struct Mir {
|
2016-04-09 20:36:55 -05:00
|
|
|
basic_blocks: Vec<BasicBlockData>,
|
|
|
|
// ...
|
2016-04-08 20:54:03 -05:00
|
|
|
}
|
2016-04-09 20:36:55 -05:00
|
|
|
|
2016-04-08 20:54:03 -05:00
|
|
|
struct BasicBlockData {
|
2016-04-09 20:36:55 -05:00
|
|
|
statements: Vec<Statement>,
|
|
|
|
terminator: Terminator,
|
|
|
|
// ...
|
2016-04-08 20:54:03 -05:00
|
|
|
}
|
2016-04-09 20:36:55 -05:00
|
|
|
|
2016-04-08 20:54:03 -05:00
|
|
|
struct Statement {
|
2016-04-09 20:36:55 -05:00
|
|
|
lvalue: Lvalue,
|
|
|
|
rvalue: Rvalue
|
2016-04-08 20:54:03 -05:00
|
|
|
}
|
2016-04-09 20:36:55 -05:00
|
|
|
|
2016-04-08 20:54:03 -05:00
|
|
|
enum Terminator {
|
2016-04-09 20:36:55 -05:00
|
|
|
Goto { target: BasicBlock },
|
|
|
|
If {
|
|
|
|
cond: Operand,
|
|
|
|
targets: [BasicBlock; 2]
|
|
|
|
},
|
|
|
|
// ...
|
2016-04-08 20:54:03 -05:00
|
|
|
}
|
|
|
|
\end{minted}
|
|
|
|
\caption{MIR (simplified)}
|
|
|
|
\label{fig:mir}
|
|
|
|
\end{figure}
|
2016-04-08 15:37:17 -05:00
|
|
|
|
2016-04-09 20:36:55 -05:00
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
2016-04-08 15:37:17 -05:00
|
|
|
\section{First implementation}
|
|
|
|
|
2016-04-08 20:54:03 -05:00
|
|
|
\subsection{Basic operation}
|
|
|
|
|
2016-04-09 20:36:55 -05:00
|
|
|
Initially, I wrote a simple version of Miri\footnote{\url{https://github.com/tsion/miri}} that was
|
|
|
|
quite capable despite its flaws. The structure of the interpreter closely mirrors the structure of
|
|
|
|
MIR itself. It starts executing a function by iterating the statement list in the starting basic
|
|
|
|
block, matching over the lvalue to produce a pointer and matching over the rvalue to decide what to
|
|
|
|
write into that pointer. Evaluating the rvalue may involve reads (such as for the two sides of a
|
|
|
|
binary operation) or construction of new values. Upon reaching the terminator, a similar matching is
|
|
|
|
done and a new basic block is selected. Finally, Miri returns to the top of the main interpreter
|
|
|
|
loop and this entire process repeats, reading statements from the new block.
|
2016-04-08 20:54:03 -05:00
|
|
|
|
|
|
|
\subsection{Function calls}
|
|
|
|
|
|
|
|
To handle function call terminators\footnote{Calls occur only as terminators, never as rvalues.},
|
|
|
|
Miri is required to store some information in a virtual call stack so that it may pick up where it
|
|
|
|
left off when the callee returns. Each stack frame stores a reference to the \rust{Mir} for the
|
|
|
|
function being executed, its local variables, its return value location\footnote{Return value
|
|
|
|
pointers are passed in by callers.}, and the basic block where execution should resume. To
|
|
|
|
facilitate returning, there is a \rust{Return} terminator which causes Miri to pop a stack frame and
|
|
|
|
resume the previous function. The entire execution of a program completes when the first function
|
|
|
|
that Miri called returns, rendering the call stack empty.
|
|
|
|
|
|
|
|
It should be noted that Miri does not itself recurse when a function is called; it merely pushes a
|
2016-04-09 20:36:55 -05:00
|
|
|
virtual stack frame and jumps to the top of the interpreter loop. Consequently, Miri can interpret
|
|
|
|
deeply recursive programs without crashing. It could also set a stack depth limit and report an
|
|
|
|
error when a program exceeds it.
|
2016-04-08 20:54:03 -05:00
|
|
|
|
|
|
|
\subsection{Flaws}
|
|
|
|
|
2016-04-09 23:22:06 -05:00
|
|
|
This version of Miri was surprisingly easy to write and already supported quite a bit of the Rust
|
|
|
|
language, including booleans, integers, if-conditions, while-loops, structs, enums, arrays, tuples,
|
|
|
|
pointers, and function calls, all in about 400 lines of Rust code. However, it had a particularly
|
|
|
|
naive value representation with a number of downsides. It resembled the data layout of a dynamic
|
|
|
|
language like Ruby or Python, where every value has the same size\footnote{A Rust \rust{enum} is a
|
|
|
|
discriminated union with a tag and data the size of the largest variant, regardless of which variant
|
|
|
|
it contains.} in the interpreter:
|
|
|
|
|
|
|
|
\begin{minted}[autogobble]{rust}
|
|
|
|
enum Value {
|
|
|
|
Uninitialized,
|
|
|
|
Bool(bool),
|
|
|
|
Int(i64),
|
|
|
|
Pointer(Pointer), // index into stack
|
|
|
|
Adt { variant: usize, data_ptr: Pointer },
|
|
|
|
// ...
|
|
|
|
}
|
|
|
|
\end{minted}
|
|
|
|
|
|
|
|
This representation did not work well for \rust{Adt}s\footnote{Algebraic data types: structs, enums,
|
|
|
|
arrays, and tuples.} and required strange hacks to support them. Their contained values were
|
|
|
|
allocated elsewhere on the stack and pointed to by the \rust{Adt} value. When it came to copying
|
|
|
|
\rust{Adt} values from place to place, this made it more complicated.
|
|
|
|
|
|
|
|
Moreover, while the \rust{Adt} issues could be worked around, this value representation made common
|
|
|
|
\rust{unsafe} programming tricks (which make assumptions about the low-level value layout)
|
|
|
|
fundamentally impossible.
|
|
|
|
|
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
|
|
|
|
|
|
|
\section{Current implementation}
|
|
|
|
|
|
|
|
Roughly halfway through my time working on Miri, Rust compiler team member Eduard
|
|
|
|
Burtescu\footnote{\href{https://www.rust-lang.org/team.html\#Compiler}{The Rust compiler team}} made
|
|
|
|
a post on Rust's internal
|
|
|
|
forums\footnote{\href{https://internals.rust-lang.org/t/mir-constant-evaluation/3143/31}{Burtescu's
|
|
|
|
``Rust Abstract Machine'' forum post}} about a ``Rust Abstract Machine'' specification which could
|
|
|
|
be used to implement more powerful compile-time function execution, similar to what is supported by
|
|
|
|
C++14's \mintinline{cpp}{constexpr} feature. After clarifying some of the details of the abstract
|
|
|
|
machine's data layout with Burtescu via IRC, I started implementing it in Miri.
|
|
|
|
|
|
|
|
\subsection{Raw value representation}
|
|
|
|
|
|
|
|
The main difference in the new value representation was to represent values by ``abstract
|
|
|
|
allocations'' containing arrays of raw bytes with different sizes depending on the types of the
|
|
|
|
values. This closely mimics how Rust values are represented when compiled for traditional machines.
|
|
|
|
In addition to the raw bytes, allocations carry information about pointers and undefined bytes.
|
|
|
|
|
|
|
|
\begin{minted}[autogobble]{rust}
|
|
|
|
struct Memory {
|
|
|
|
map: HashMap<AllocId, Allocation>,
|
|
|
|
next_id: AllocId,
|
|
|
|
}
|
|
|
|
|
|
|
|
struct Allocation {
|
|
|
|
bytes: Vec<u8>,
|
|
|
|
relocations: BTreeMap<usize, AllocId>,
|
|
|
|
undef_mask: UndefMask,
|
|
|
|
}
|
|
|
|
\end{minted}
|
|
|
|
|
|
|
|
\subsubsection{Relocations}
|
|
|
|
|
|
|
|
The abstract machine represents pointers through ``relocations'', which are analogous to relocations
|
|
|
|
in linkers\footnote{\href{https://en.wikipedia.org/wiki/Relocation_(computing)}{Relocation
|
|
|
|
(computing) - Wikipedia}}. Instead of storing a global memory address in the raw byte representation
|
|
|
|
like a traditional machine, we store an offset from the start of the target allocation and add an
|
|
|
|
entry to the relocation table. The entry maps the index of the start of the offset bytes to the
|
|
|
|
\rust{AllocId} of the target allocation.
|
|
|
|
|
|
|
|
\begin{figure}[ht]
|
|
|
|
\begin{minted}[autogobble]{rust}
|
|
|
|
let a: [i16; 3] = [2, 4, 6];
|
|
|
|
let b = &a[1];
|
|
|
|
// A: 02 00 04 00 06 00 (6 bytes)
|
|
|
|
// B: 02 00 00 00 (4 bytes)
|
|
|
|
// └───(A)───┘
|
|
|
|
\end{minted}
|
|
|
|
\caption{Example relocation on 32-bit little-endian}
|
|
|
|
\label{fig:reloc}
|
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
In effect, the abstract machine treats each allocation as a separate address space and represents
|
|
|
|
pointers as \rust{(address_space, offset)} pairs. This makes it easy to detect when pointer accesses
|
|
|
|
go out of bounds.
|
|
|
|
|
|
|
|
See \autoref{fig:reloc} for an example of a relocation. Variable \rust{b} points to the second
|
|
|
|
16-bit integer in \rust{a}, so it contains a relocation with offset 2 and target allocation
|
|
|
|
\rust{A}.
|
|
|
|
|
|
|
|
\subsubsection{Undefined byte mask}
|
|
|
|
|
|
|
|
The final piece of an abstract allocation is the undefined byte mask. Logically, we store a boolean
|
|
|
|
for the definedness of every byte in the allocation, but there are multiple ways to make the storage
|
|
|
|
more compact. I tried two implementations: one based on the endpoints of alternating ranges of
|
|
|
|
defined and undefined bytes and the other based on a simple bitmask. The former is more compact but
|
|
|
|
I found it surprisingly difficult to update cleanly. I currently use the bitmask system, which is
|
|
|
|
comparatively trivial.
|
|
|
|
|
|
|
|
See \autoref{fig:undef} for an example undefined byte, represented by underscores. Note that there
|
|
|
|
would still be a value for the second byte in the byte array, but we don't care what it is. The
|
|
|
|
bitmask would be $10_2$ i.e. \rust{[true, false]}.
|
|
|
|
|
|
|
|
\begin{figure}[hb]
|
|
|
|
\begin{minted}[autogobble]{rust}
|
|
|
|
let a: [u8; 2] = unsafe {
|
|
|
|
[1, std::mem::uninitialized()]
|
|
|
|
};
|
|
|
|
// A: 01 __ (2 bytes)
|
|
|
|
\end{minted}
|
|
|
|
\caption{Example undefined byte}
|
|
|
|
\label{fig:undef}
|
|
|
|
\end{figure}
|
2016-04-08 20:54:03 -05:00
|
|
|
|
2016-04-08 15:37:17 -05:00
|
|
|
% TODO(tsion): Find a place for this text.
|
2016-04-09 23:22:06 -05:00
|
|
|
% Making Miri work was primarily an implementation problem. Writing an interpreter which models values
|
|
|
|
% of varying sizes, stack and heap allocation, unsafe memory operations, and more requires some
|
|
|
|
% unconventional techniques compared to many interpreters. Miri's execution remains safe even while
|
|
|
|
% simulating execution of unsafe code, which allows it to detect when unsafe code does something
|
|
|
|
% invalid.
|
2016-04-08 15:37:17 -05:00
|
|
|
|
2016-04-08 20:54:03 -05:00
|
|
|
\blindtext
|
2016-04-08 15:37:17 -05:00
|
|
|
|
|
|
|
\section{Data layout}
|
|
|
|
|
2016-04-09 20:36:55 -05:00
|
|
|
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
|
2016-04-08 15:37:17 -05:00
|
|
|
|
2016-04-08 20:54:03 -05:00
|
|
|
\section{Future work}
|
|
|
|
|
|
|
|
Other possible uses for Miri include:
|
|
|
|
|
|
|
|
\begin{itemize}
|
|
|
|
\item A graphical or text-mode debugger that steps through MIR execution one statement at a time,
|
|
|
|
for figuring out why some compile-time execution is raising an error or simply learning how Rust
|
|
|
|
works at a low level.
|
|
|
|
\item An read-eval-print-loop (REPL) for Rust may be easier to implement on top of Miri than the
|
|
|
|
usual LLVM back-end.
|
|
|
|
\item An extended version of Miri could be developed apart from the purpose of compile-time
|
|
|
|
execution that is able to run foreign functions from C/C++ and generally have full access to the
|
|
|
|
operating system. Such a version of Miri could be used to more quickly prototype changes to the
|
|
|
|
Rust language that would otherwise require changes to the LLVM back-end.
|
|
|
|
\item Miri might be useful for unit-testing the compiler by comparing the results of Miri's
|
|
|
|
execution against the results of LLVM-compiled machine code's execution. This would help to
|
|
|
|
guarantee that compile-time execution works the same as runtime execution.
|
|
|
|
\end{itemize}
|
|
|
|
|
2016-04-08 15:37:17 -05:00
|
|
|
\end{document}
|