rust/doc/tutorial/data.md
Marijn Haverbeke 769e9b669b Write briefly about syntax extension in the syntax section
The currently existing syntax extension facilities don't really merit
their own section.
2011-11-02 13:27:29 +01:00

11 KiB
Raw Blame History

Datatypes

Rust datatypes are, by default, immutable. The core datatypes of Rust are structural records and 'tags' (tagged unions, algebraic data types).

type point = {x: float, y: float};
tag shape {
    circle(point, float);
    rectangle(point, point);
}
let my_shape = circle({x: 0.0, y: 0.0}, 10.0);

Records

Rust record types are written {field1: TYPE, field2: TYPE [, ...]}, and record literals are written in the same way, but with expressions instead of types. They are quite similar to C structs, and even laid out the same way in memory (so you can read from a Rust struct in C, and vice-versa).

The dot operator is used to access record fields (mypoint.x).

Fields that you want to mutate must be explicitly marked as such. For example...

type stack = {content: [int], mutable head: uint};

With such a type, you can do mystack.head += 1u. When the mutable is omitted from the type, such an assignment would result in a type error.

To 'update' an immutable record, you use functional record update syntax, by ending a record literal with the keyword with:

let oldpoint = {x: 10f, y: 20f};
let newpoint = {x: 0f with oldpoint};
assert newpoint == {x: 0f, y: 20f};

This will create a new struct, copying all the fields from oldpoint into it, except for the ones that are explicitly set in the literal.

Rust record types are structural. This means that {x: float, y: float} is not just a way to define a new type, but is the actual name of the type. Record types can be used without first defining them. If module A defines type point = {x: float, y: float}, and module B, without knowing anything about A, defines a function that returns an {x: float, y: float}, you can use that return value as a point in module A. (Remember that type defines an additional name for a type, not an actual new type.)

Record patterns

Records can be destructured on in alt patterns. The basic syntax is {fieldname: pattern, ...}, but the pattern for a field can be omitted as a shorthand for simply binding the variable with the same name as the field.

alt mypoint {
    {x: 0f, y: y_name} { /* Provide sub-patterns for fields */ }
    {x, y}             { /* Simply bind the fields */ }
}

The field names of a record do not have to appear in a pattern in the same order they appear in the type. When you are not interested in all the fields of a record, a record pattern may end with , _ (as in {field1, _}) to indicate that you're ignoring all other fields.

Tags

Tags [FIXME terminology] are datatypes that have several different representations. For example, the type shown earlier:

tag shape {
    circle(point, float);
    rectangle(point, point);
}

A value of this type is either a circle¸ in which case it contains a point record and a float, or a rectangle, in which case it contains two point records. The run-time representation of such a value includes an identifier of the actual form that it holds, much like the 'tagged union' pattern in C, but with better ergonomics.

The above declaration will define a type shape that can be used to refer to such shapes, and two functions, circle and rectangle, which can be used to construct values of the type (taking arguments of the specified types). So circle({x: 0f, y: 0f}, 10f) is the way to create a new circle.

Tag variants do not have to have parameters. This, for example, is equivalent to an enum in C:

tag direction {
    north;
    east;
    south;
    west;
};

This will define north, east, south, and west as constants, all of which have type direction.

There is a special case for tags with a single variant. These are used to define new types in such a way that the new name is not just a synonym for an existing type, but its own distinct type. If you say:

tag gizmo_id = int;

That is a shorthand for this:

tag gizmo_id { gizmo_id(int); }

Tag types like this can have their content extracted with the dereference (*) unary operator:

let my_gizmo_id = gizmo_id(10);
let id_int: int = *my_gizmo_id;

Tag patterns

For tag types with multiple variants, destructuring is the only way to get at their contents. All variant constructors can be used as patterns, as in this definition of area:

fn area(sh: shape) -> float {
    alt sh {
        circle(_, size) { std::math::pi * size * size }
        rectangle({x, y}, {x: x2, y: y2}) { (x2 - x) * (y2 - y) }
    }
}

For variants without arguments, you have to write variantname. (with a dot at the end) to match them in a pattern. This to prevent ambiguity between matching a variant name and binding a new variable.

fn point_from_direction(dir: direction) -> point {
    alt dir {
        north. { {x:  0f, y:  1f} }
        east.  { {x:  1f, y:  0f} }
        south. { {x:  0f, y: -1f} }
        west.  { {x: -1f, y:  0f} }
    }
}

Tuples

Tuples in Rust behave exactly like records, except that their fields do not have names (and can thus not be accessed with dot notation). Tuples can have any arity except for 0 or 1 (though you may see nil, (), as the empty tuple if you like).

let mytup: (int, int, float) = (10, 20, 30.0);
alt mytup {
  (a, b, c) { log a + b + (c as int); }
}

Pointers

In contrast to a lot of modern languages, record and tag types in Rust are not represented as pointers to allocated memory. They are, like in C and C++, represented directly. This means that if you let x = {x: 1f, y: 1f};, you are creating a record on the stack. If you then copy it into a data structure, the whole record is copied, not just a pointer.

For small records like point, this is usually still more efficient than allocating memory and going through a pointer. But for big records, or records with mutable fields, it can be useful to have a single copy on the heap, and refer to that through a pointer.

Rust supports several types of pointers. The simplest is the unsafe pointer, written *TYPE, which is a completely unchecked pointer type only used in unsafe code (and thus, in typical Rust code, very rarely). The safe pointer types are @TYPE for shared, reference-counted boxes, and ~TYPE, for uniquely-owned pointers.

All pointer types can be dereferenced with the * unary operator.

Shared boxes

Shared boxes are pointers to heap-allocated, reference counted memory. A cycle collector ensures that circular references do not result in memory leaks.

Creating a shared box is done by simply applying the binary @ operator to an expression. The result of the expression will be boxed, resulting in a box of the right type. For example:

let x = @10; // New box, refcount of 1
let y = x; // Copy the pointer, increase refcount
// When x and y go out of scope, refcount goes to 0, box is freed

NOTE: We may in the future switch to garbage collection, rather than reference counting, for shared boxes.

Shared boxes never cross task boundaries.

Unique boxes

In contrast to shared boxes, unique boxes are not reference counted. Instead, it is statically guaranteed that only a single owner of the box exists at any time.

let x = ~10;
let y <- x;

This is where the 'move' (<-) operator comes in. It is similar to =, but it de-initializes its source. Thus, the unique box can move from x to y, without violating the constraint that it only has a single owner.

NOTE: If you do y = x instead, the box will be copied. We should emit warning for this, or disallow it entirely, but do not currently do so.

Unique boxes, when they do not contain any shared boxes, can be sent to other tasks. The sending task will give up ownership of the box, and won't be able to access it afterwards. The receiving task will become the sole owner of the box.

Mutability

All pointer types have a mutable variant, written @mutable TYPE or ~mutable TYPE. Given such a pointer, you can write to its contents by combining the dereference operator with a mutating action.

fn increase_contents(pt: @mutable int) {
    *pt += 1;
}

Vectors

Rust vectors are always heap-allocated and unique. A value of type [TYPE] is represented by a pointer to a section of heap memory containing any number of TYPE values.

NOTE: This uniqueness is turning out to be quite awkward in practice, and might change.

Vector literals are enclosed in square brackets. Dereferencing is done with square brackets (and zero-based):

let myvec = [true, false, true, false];
if myvec[1] { std::io::println("boom"); }

By default, vectors are immutable—you can not replace their elements. The type written as [mutable TYPE] is a vector with mutable elements. Mutable vector literals are written [mutable] (empty) or [mutable 1, 2, 3] (with elements).

Growing a vector in Rust is not as inefficient as it looks (the + operator means concatenation when applied to vector types):

let myvec = [], i = 0;
while i < 100 {
    myvec += [i];
    i += 1;
}

Because a vector is unique, replacing it with a longer one (which is what += [i] does) is indistinguishable from appending to it in-place. Vector representations are optimized to grow logarithmically, so the above code generates about the same amount of copying and reallocation as push implementations in most other languages.

Strings

The str type in Rust is represented exactly the same way as a vector of bytes ([u8]), except that it is guaranteed to have a trailing null byte (for interoperability with C APIs).

This sequence of bytes is interpreted as an UTF-8 encoded sequence of characters. This has the advantage that UTF-8 encoded I/O (which should really be the goal for modern systems) is very fast, and that strings have, for most intents and purposes, a nicely compact representation. It has the disadvantage that you only get constant-time access by byte, not by character.

A lot of algorithms don't need constant-time indexed access (they iterate over all characters, which std::str::chars helps with), and for those that do, many don't need actual characters, and can operate on bytes. For algorithms that do really need to index by character, there's the option to convert your string to a character vector (using std::str::to_chars).

Like vectors, strings are always unique. You can wrap them in a shared box to share them. Unlike vectors, there is no mutable variant of strings. They are always immutable.

Resources

Resources are data types that have a destructor associated with them.

resource file_desc(fd: int) {
    close_file_desc(fd);
}

This defines a type file_desc and a constructor of the same name, which takes an integer. Values of such a type can not be copied, and when they are destroyed (by going out of scope, or, when boxed, when their box is cleaned up), their body runs. In the example above, this would cause the given file descriptor to be closed.

NOTE: We're considering alternative approaches for data types with destructors. Resources might go away in the future.