rust/src/doc/intro.md
2015-03-22 15:40:46 -04:00

591 lines
20 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

% A 30-minute Introduction to Rust
Rust is a modern systems programming language focusing on safety and speed. It
accomplishes these goals by being memory safe without using garbage collection.
This introduction will give you a rough idea of what Rust is like, eliding many
details. It does not require prior experience with systems programming, but you
may find the syntax easier if you've used a "curly brace" programming language
before, like C or JavaScript. The concepts are more important than the syntax,
so don't worry if you don't get every last detail: you can read [The
Rust Programming Language](book/index.html) to get a more complete explanation.
Because this is about high-level concepts, you don't need to actually install
Rust to follow along. If you'd like to anyway, check out [the
homepage](http://rust-lang.org) for explanation.
To show off Rust, let's talk about how easy it is to get started with Rust.
Then, we'll talk about Rust's most interesting feature, *ownership*, and
then discuss how it makes concurrency easier to reason about. Finally,
we'll talk about how Rust breaks down the perceived dichotomy between speed
and safety.
# Tools
Getting started on a new Rust project is incredibly easy, thanks to Rust's
package manager, [Cargo](http://crates.io).
To start a new project with Cargo, use `cargo new`:
```{bash}
$ cargo new hello_world --bin
```
We're passing `--bin` because we're making a binary program: if we
were making a library, we'd leave it off.
Let's check out what Cargo has generated for us:
```{bash}
$ cd hello_world
$ tree .
.
├── Cargo.toml
└── src
└── main.rs
1 directory, 2 files
```
This is all we need to get started. First, let's check out `Cargo.toml`:
```{toml}
[package]
name = "hello_world"
version = "0.0.1"
authors = ["Your Name <you@example.com>"]
```
This is called a *manifest*, and it contains all of the metadata that Cargo
needs to compile your project.
Here's what's in `src/main.rs`:
```{rust}
fn main() {
println!("Hello, world!");
}
```
Cargo generated a "Hello World" for us. We'll talk more about the syntax here
later, but that's what Rust code looks like! Let's compile and run it:
```{bash}
$ cargo run
Compiling hello_world v0.0.1 (file:///Users/you/src/hello_world)
Running `target/hello_world`
Hello, world!
```
Using an external dependency in Rust is incredibly easy. You add a line to
your `Cargo.toml`:
```{toml}
[package]
name = "hello_world"
version = "0.0.1"
authors = ["Your Name <someone@example.com>"]
[dependencies.semver]
git = "https://github.com/rust-lang/semver.git"
```
You added the `semver` library, which parses version numbers and compares them
according to the [SemVer specification](http://semver.org/).
Now, you can pull in that library using `extern crate` in
`main.rs`.
```{rust,ignore}
extern crate semver;
use semver::Version;
fn main() {
assert!(Version::parse("1.2.3") == Ok(Version {
major: 1u64,
minor: 2u64,
patch: 3u64,
pre: vec!(),
build: vec!(),
}));
println!("Versions compared successfully!");
}
```
Again, we'll discuss the exact details of all of this syntax soon. For now,
let's compile and run it:
```{bash}
$ cargo run
Updating git repository `https://github.com/rust-lang/semver.git`
Compiling semver v0.0.1 (https://github.com/rust-lang/semver.git#bf739419)
Compiling hello_world v0.0.1 (file:///home/you/projects/hello_world)
Running `target/hello_world`
Versions compared successfully!
```
Because we only specified a repository without a version, if someone else were
to try out our project at a later date, when `semver` was updated, they would
get a different, possibly incompatible version. To solve this problem, Cargo
produces a file, `Cargo.lock`, which records the versions of any dependencies.
This gives us repeatable builds.
There is a lot more here, and this is a whirlwind tour, but you should feel
right at home if you've used tools like [Bundler](http://bundler.io/),
[npm](https://www.npmjs.org/), or [pip](https://pip.pypa.io/en/latest/).
There's no `Makefile`s or endless `autotools` output here. (Rust's tooling does
[play nice with external libraries written in those
tools](http://doc.crates.io/build-script.html), if you need to.)
Enough about tools, let's talk code!
# Ownership
Rust's defining feature is "memory safety without garbage collection". Let's
take a moment to talk about what that means. *Memory safety* means that the
programming language eliminates certain kinds of bugs, such as [buffer
overflows](http://en.wikipedia.org/wiki/Buffer_overflow) and [dangling
pointers](http://en.wikipedia.org/wiki/Dangling_pointer). These problems occur
when you have unrestricted access to memory. As an example, here's some Ruby
code:
```{ruby}
v = []
v.push("Hello")
x = v[0]
v.push("world")
puts x
```
We make an array, `v`, and then call `push` on it. `push` is a method which
adds an element to the end of an array.
Next, we make a new variable, `x`, that's equal to the first element of
the array. Simple, but this is where the "bug" will appear.
Let's keep going. We then call `push` again, pushing "world" onto the
end of the array. `v` now is `["Hello", "world"]`.
Finally, we print `x` with the `puts` method. This prints "Hello."
All good? Let's go over a similar, but subtly different example, in C++:
```{cpp}
#include<iostream>
#include<vector>
#include<string>
int main() {
std::vector<std::string> v;
v.push_back("Hello");
std::string& x = v[0];
v.push_back("world");
std::cout << x;
}
```
It's a little more verbose due to the static typing, but it's almost the same
thing. We make a `std::vector` of `std::string`s, we call `push_back` (same as
`push`) on it, take a reference to the first element of the vector, call
`push_back` again, and then print out the reference.
There's two big differences here: one, they're not _exactly_ the same thing,
and two...
```{bash}
$ g++ hello.cpp -Wall -Werror
$ ./a.out
Segmentation fault (core dumped)
```
A crash! (Note that this is actually system-dependent. Because referring to an
invalid reference is undefined behavior, the compiler can do anything,
including the right thing!) Even though we compiled with flags to give us as
many warnings as possible, and to treat those warnings as errors, we got no
errors. When we ran the program, it crashed.
Why does this happen? When we append to an array, its length changes. Since
its length changes, we may need to allocate more memory. In Ruby, this happens
as well, we just don't think about it very often. So why does the C++ version
segfault when we allocate more memory?
The answer is that in the C++ version, `x` is a *reference* to the memory
location where the first element of the array is stored. But in Ruby, `x` is a
standalone value, not connected to the underlying array at all. Let's dig into
the details for a moment. Your program has access to memory, provided to it by
the operating system. Each location in memory has an address. So when we make
our vector, `v`, it's stored in a memory location somewhere:
| location | name | value |
|----------|------|-------|
| 0x30 | v | |
(Address numbers made up, and in hexadecimal. Those of you with deep C++
knowledge, there are some simplifications going on here, like the lack of an
allocated length for the vector. This is an introduction.)
When we push our first string onto the array, we allocate some memory,
and `v` refers to it:
| location | name | value |
|----------|------|----------|
| 0x30 | v | 0x18 |
| 0x18 | | "Hello" |
We then make a reference to that first element. A reference is a variable
that points to a memory location, so its value is the memory location of
the `"Hello"` string:
| location | name | value |
|----------|------|----------|
| 0x30 | v | 0x18 |
| 0x18 | | "Hello" |
| 0x14 | x | 0x18 |
When we push `"world"` onto the vector with `push_back`, there's no room:
we only allocated one element. So, we need to allocate two elements,
copy the `"Hello"` string over, and update the reference. Like this:
| location | name | value |
|----------|------|----------|
| 0x30 | v | 0x08 |
| 0x18 | | GARBAGE |
| 0x14 | x | 0x18 |
| 0x08 | | "Hello" |
| 0x04 | | "world" |
Note that `v` now refers to the new list, which has two elements. It's all
good. But our `x` didn't get updated! It still points at the old location,
which isn't valid anymore. In fact, [the documentation for `push_back` mentions
this](http://en.cppreference.com/w/cpp/container/vector/push_back):
> If the new `size()` is greater than `capacity()` then all iterators and
> references (including the past-the-end iterator) are invalidated.
Finding where these iterators and references are is a difficult problem, and
even in this simple case, `g++` can't help us here. While the bug is obvious in
this case, in real code, it can be difficult to track down the source of the
error.
Before we talk about this solution, why didn't our Ruby code have this problem?
The semantics are a little more complicated, and explaining Ruby's internals is
out of the scope of a guide to Rust. But in a nutshell, Ruby's garbage
collector keeps track of references, and makes sure that everything works as
you might expect. This comes at an efficiency cost, and the internals are more
complex. If you'd really like to dig into the details, [this
article](http://patshaughnessy.net/2012/1/18/seeing-double-how-ruby-shares-string-values)
can give you more information.
Garbage collection is a valid approach to memory safety, but Rust chooses a
different path. Let's examine what the Rust version of this looks like:
```{rust,ignore}
fn main() {
let mut v = vec![];
v.push("Hello");
let x = &v[0];
v.push("world");
println!("{}", x);
}
```
This looks like a bit of both: fewer type annotations, but we do create new
variables with `let`. The method name is `push`, some other stuff is different,
but it's pretty close. So what happens when we compile this code? Does Rust
print `"Hello"`, or does Rust crash?
Neither. It refuses to compile:
```bash
$ cargo run
Compiling hello_world v0.0.1 (file:///Users/you/src/hello_world)
main.rs:8:5: 8:6 error: cannot borrow `v` as mutable because it is also borrowed as immutable
main.rs:8 v.push("world");
^
main.rs:6:14: 6:15 note: previous borrow of `v` occurs here; the immutable borrow prevents subsequent moves or mutable borrows of `v` until the borrow ends
main.rs:6 let x = &v[0];
^
main.rs:11:2: 11:2 note: previous borrow ends here
main.rs:1 fn main() {
...
main.rs:11 }
^
error: aborting due to previous error
```
When we try to mutate the array by `push`ing it the second time, Rust throws
an error. It says that we "cannot borrow v as mutable because it is also
borrowed as immutable." What does it mean by "borrowed"?
In Rust, the type system encodes the notion of *ownership*. The variable `v`
is an *owner* of the vector. When we make a reference to `v`, we let that
variable (in this case, `x`) *borrow* it for a while. Just like if you own a
book, and you lend it to me, I'm borrowing the book.
So, when I try to modify the vector with the second call to `push`, I need
to be owning it. But `x` is borrowing it. You can't modify something that
you've lent to someone. And so Rust throws an error.
So how do we fix this problem? Well, we can make a copy of the element:
```{rust}
fn main() {
let mut v = vec![];
v.push("Hello");
let x = v[0].clone();
v.push("world");
println!("{}", x);
}
```
Note the addition of `clone()`. This creates a copy of the element, leaving
the original untouched. Now, we no longer have two references to the same
memory, and so the compiler is happy. Let's give that a try:
```{bash}
$ cargo run
Compiling hello_world v0.0.1 (file:///Users/you/src/hello_world)
Running `target/hello_world`
Hello
```
Same result. Now, making a copy can be inefficient, so this solution may not be
acceptable. There are other ways to get around this problem, but this is a toy
example, and because we're in an introduction, we'll leave that for later.
The point is, the Rust compiler and its notion of ownership has saved us from a
bug that would crash the program. We've achieved safety, at compile time,
without needing to rely on a garbage collector to handle our memory.
# Concurrency
Rust's ownership model can help in other ways, as well. For example, take
concurrency. Concurrency is a big topic, and an important one for any modern
programming language. Let's take a look at how ownership can help you write
safe concurrent programs.
Here's an example of a concurrent Rust program:
```{rust}
use std::thread;
fn main() {
let guards: Vec<_> = (0..10).map(|_| {
thread::scoped(|| {
println!("Hello, world!");
})
}).collect();
}
```
This program creates ten threads, which all print `Hello, world!`. The `scoped`
function takes one argument, a closure, indicated by the double bars `||`. This
closure is executed in a new thread created by `scoped`. The method is called
`scoped` because it returns a 'join guard', which will automatically join the
child thread when it goes out of scope. Because we `collect` these guards into
a `Vec<T>`, and that vector goes out of scope at the end of our program, our
program will wait for every thread to finish before finishing.
One common form of problem in concurrent programs is a *data race*.
This occurs when two different threads attempt to access the same
location in memory in a non-synchronized way, where at least one of
them is a write. If one thread is attempting to read, and one thread
is attempting to write, you cannot be sure that your data will not be
corrupted. Note the first half of that requirement: two threads that
attempt to access the same location in memory. Rust's ownership model
can track which pointers own which memory locations, which solves this
problem.
Let's see an example. This Rust code will not compile:
```{rust,ignore}
use std::thread;
fn main() {
let mut numbers = vec![1, 2, 3];
let guards: Vec<_> = (0..3).map(|i| {
thread::scoped(move || {
numbers[i] += 1;
println!("numbers[{}] is {}", i, numbers[i]);
})
}).collect();
}
```
It gives us this error:
```text
7:25: 10:6 error: cannot move out of captured outer variable in an `FnMut` closure
7 thread::scoped(move || {
8 numbers[i] += 1;
9 println!("numbers[{}] is {}", i, numbers[i]);
10 })
error: aborting due to previous error
```
This is a little confusing because there are two closures here: the one passed
to `map`, and the one passed to `thread::scoped`. In this case, the closure for
`thread::scoped` is attempting to reference `numbers`, a `Vec<i32>`. This
closure is a `FnOnce` closure, as thats what `thread::scoped` takes as an
argument. `FnOnce` closures take ownership of their environment. Thats fine,
but theres one detail: because of `map`, were going to make three of these
closures. And since all three try to take ownership of `numbers`, that would be
a problem. Thats what it means by cannot move out of captured outer
variable: our `thread::scoped` closure wants to take ownership, and it cant,
because the closure for `map` wont let it.
What to do here? Rust has two types that helps us: `Arc<T>` and `Mutex<T>`.
*Arc* stands for "atomically reference counted". In other words, an Arc will
keep track of the number of references to something, and not free the
associated resource until the count is zero. The *atomic* portion refers to an
Arc's usage of concurrency primitives to atomically update the count, making it
safe across threads. If we use an Arc, we can have our three references. But,
an Arc does not allow mutable borrows of the data it holds, and we want to
modify what we're sharing. In this case, we can use a `Mutex<T>` inside of our
Arc. A Mutex will synchronize our accesses, so that we can ensure that our
mutation doesn't cause a data race.
Here's what using an Arc with a Mutex looks like:
```{rust}
use std::thread;
use std::sync::{Arc,Mutex};
fn main() {
let numbers = Arc::new(Mutex::new(vec![1, 2, 3]));
let guards: Vec<_> = (0..3).map(|i| {
let number = numbers.clone();
thread::scoped(move || {
let mut array = number.lock().unwrap();
array[i] += 1;
println!("numbers[{}] is {}", i, array[i]);
})
}).collect();
}
```
We first have to `use` the appropriate library, and then we wrap our vector in
an Arc with the call to `Arc::new()`. Inside of the loop, we make a new
reference to the Arc with the `clone()` method. This will increment the
reference count. When each new `numbers` variable binding goes out of scope, it
will decrement the count. The `lock()` call will return us a reference to the
value inside the Mutex, and block any other calls to `lock()` until said
reference goes out of scope.
We can compile and run this program without error, and in fact, see the
non-deterministic aspect:
```{shell}
$ cargo run
Compiling hello_world v0.0.1 (file:///Users/you/src/hello_world)
Running `target/hello_world`
numbers[1] is 3
numbers[0] is 2
numbers[2] is 4
$ cargo run
Running `target/hello_world`
numbers[2] is 4
numbers[1] is 3
numbers[0] is 2
```
Each time, we can get a slightly different output because the threads are not
guaranteed to run in any set order. If you get the same order every time it is
because each of these threads are very small and complete too fast for their
indeterminate behavior to surface.
The important part here is that the Rust compiler was able to use ownership to
give us assurance _at compile time_ that we weren't doing something incorrect
with regards to concurrency. In order to share ownership, we were forced to be
explicit and use a mechanism to ensure that it would be properly handled.
# Safety _and_ Speed
Safety and speed are always presented as a continuum. At one end of the spectrum,
you have maximum speed, but no safety. On the other end, you have absolute safety
with no speed. Rust seeks to break out of this paradigm by introducing safety at
compile time, ensuring that you haven't done anything wrong, while compiling to
the same low-level code you'd expect without the safety.
As an example, Rust's ownership system is _entirely_ at compile time. The
safety check that makes this an error about moved values:
```{rust,ignore}
use std::thread;
fn main() {
let numbers = vec![1, 2, 3];
let guards: Vec<_> = (0..3).map(|i| {
thread::scoped(move || {
println!("{}", numbers[i]);
})
}).collect();
}
```
carries no runtime penalty. And while some of Rust's safety features do have
a run-time cost, there's often a way to write your code in such a way that
you can remove it. As an example, this is a poor way to iterate through
a vector:
```{rust}
let vec = vec![1, 2, 3];
for i in 0..vec.len() {
println!("{}", vec[i]);
}
```
The reason is that the access of `vec[i]` does bounds checking, to ensure
that we don't try to access an invalid index. However, we can remove this
while retaining safety. The answer is iterators:
```{rust}
let vec = vec![1, 2, 3];
for x in &vec {
println!("{}", x);
}
```
This version uses an iterator that yields each element of the vector in turn.
Because we have a reference to the element, rather than the whole vector itself,
there's no array access bounds to check.
# Learning More
I hope that this taste of Rust has given you an idea if Rust is the right
language for you. We talked about Rust's tooling, how encoding ownership into
the type system helps you find bugs, how Rust can help you write correct
concurrent code, and how you don't have to pay a speed cost for much of this
safety.
To continue your Rustic education, read [The Rust Programming
Language](book/index.html) for a more in-depth exploration of Rust's syntax and
concepts.