rust/src/doc/grammar.md
Kyle Robinson Young a6aa6456ee doc: fixing typos
2015-10-10 09:15:55 -07:00

797 lines
18 KiB
Markdown

% Grammar
# Introduction
This document is the primary reference for the Rust programming language grammar. It
provides only one kind of material:
- Chapters that formally define the language grammar.
This document does not serve as an introduction to the language. Background
familiarity with the language is assumed. A separate [guide] is available to
help acquire such background.
This document also does not serve as a reference to the [standard] library
included in the language distribution. Those libraries are documented
separately by extracting documentation attributes from their source code. Many
of the features that one might expect to be language features are library
features in Rust, so what you're looking for may be there, not here.
[guide]: guide.html
[standard]: std/index.html
# Notation
Rust's grammar is defined over Unicode codepoints, each conventionally denoted
`U+XXXX`, for 4 or more hexadecimal digits `X`. _Most_ of Rust's grammar is
confined to the ASCII range of Unicode, and is described in this document by a
dialect of Extended Backus-Naur Form (EBNF), specifically a dialect of EBNF
supported by common automated LL(k) parsing tools such as `llgen`, rather than
the dialect given in ISO 14977. The dialect can be defined self-referentially
as follows:
```antlr
grammar : rule + ;
rule : nonterminal ':' productionrule ';' ;
productionrule : production [ '|' production ] * ;
production : term * ;
term : element repeats ;
element : LITERAL | IDENTIFIER | '[' productionrule ']' ;
repeats : [ '*' | '+' ] NUMBER ? | NUMBER ? | '?' ;
```
Where:
- Whitespace in the grammar is ignored.
- Square brackets are used to group rules.
- `LITERAL` is a single printable ASCII character, or an escaped hexadecimal
ASCII code of the form `\xQQ`, in single quotes, denoting the corresponding
Unicode codepoint `U+00QQ`.
- `IDENTIFIER` is a nonempty string of ASCII letters and underscores.
- The `repeat` forms apply to the adjacent `element`, and are as follows:
- `?` means zero or one repetition
- `*` means zero or more repetitions
- `+` means one or more repetitions
- NUMBER trailing a repeat symbol gives a maximum repetition count
- NUMBER on its own gives an exact repetition count
This EBNF dialect should hopefully be familiar to many readers.
## Unicode productions
A few productions in Rust's grammar permit Unicode codepoints outside the ASCII
range. We define these productions in terms of character properties specified
in the Unicode standard, rather than in terms of ASCII-range codepoints. The
section [Special Unicode Productions](#special-unicode-productions) lists these
productions.
## String table productions
Some rules in the grammar — notably [unary
operators](#unary-operator-expressions), [binary
operators](#binary-operator-expressions), and [keywords](#keywords) — are
given in a simplified form: as a listing of a table of unquoted, printable
whitespace-separated strings. These cases form a subset of the rules regarding
the [token](#tokens) rule, and are assumed to be the result of a
lexical-analysis phase feeding the parser, driven by a DFA, operating over the
disjunction of all such string table entries.
When such a string enclosed in double-quotes (`"`) occurs inside the grammar,
it is an implicit reference to a single member of such a string table
production. See [tokens](#tokens) for more information.
# Lexical structure
## Input format
Rust input is interpreted as a sequence of Unicode codepoints encoded in UTF-8.
Most Rust grammar rules are defined in terms of printable ASCII-range
codepoints, but a small number are defined in terms of Unicode properties or
explicit codepoint lists. [^inputformat]
[^inputformat]: Substitute definitions for the special Unicode productions are
provided to the grammar verifier, restricted to ASCII range, when verifying the
grammar in this document.
## Special Unicode Productions
The following productions in the Rust grammar are defined in terms of Unicode
properties: `ident`, `non_null`, `non_eol`, `non_single_quote` and
`non_double_quote`.
### Identifiers
The `ident` production is any nonempty Unicode[^non_ascii_idents] string of
the following form:
[^non_ascii_idents]: Non-ASCII characters in identifiers are currently feature
gated. This is expected to improve soon.
- The first character has property `XID_start`
- The remaining characters have property `XID_continue`
that does _not_ occur in the set of [keywords](#keywords).
> **Note**: `XID_start` and `XID_continue` as character properties cover the
> character ranges used to form the more familiar C and Java language-family
> identifiers.
### Delimiter-restricted productions
Some productions are defined by exclusion of particular Unicode characters:
- `non_null` is any single Unicode character aside from `U+0000` (null)
- `non_eol` is `non_null` restricted to exclude `U+000A` (`'\n'`)
- `non_single_quote` is `non_null` restricted to exclude `U+0027` (`'`)
- `non_double_quote` is `non_null` restricted to exclude `U+0022` (`"`)
## Comments
```antlr
comment : block_comment | line_comment ;
block_comment : "/*" block_comment_body * "*/" ;
block_comment_body : [block_comment | character] * ;
line_comment : "//" non_eol * ;
```
**FIXME:** add doc grammar?
## Whitespace
```antlr
whitespace_char : '\x20' | '\x09' | '\x0a' | '\x0d' ;
whitespace : [ whitespace_char | comment ] + ;
```
## Tokens
```antlr
simple_token : keyword | unop | binop ;
token : simple_token | ident | literal | symbol | whitespace token ;
```
### Keywords
<p id="keyword-table-marker"></p>
| | | | | |
|----------|----------|----------|----------|---------|
| abstract | alignof | as | become | box |
| break | const | continue | crate | do |
| else | enum | extern | false | final |
| fn | for | if | impl | in |
| let | loop | macro | match | mod |
| move | mut | offsetof | override | priv |
| proc | pub | pure | ref | return |
| Self | self | sizeof | static | struct |
| super | trait | true | type | typeof |
| unsafe | unsized | use | virtual | where |
| while | yield | | | |
Each of these keywords has special meaning in its grammar, and all of them are
excluded from the `ident` rule.
### Literals
```antlr
lit_suffix : ident;
literal : [ string_lit | char_lit | byte_string_lit | byte_lit | num_lit | bool_lit ] lit_suffix ?;
```
The optional `lit_suffix` production is only used for certain numeric literals,
but is reserved for future extension. That is, the above gives the lexical
grammar, but a Rust parser will reject everything but the 12 special cases
mentioned in [Number literals](reference.html#number-literals) in the
reference.
#### Character and string literals
```antlr
char_lit : '\x27' char_body '\x27' ;
string_lit : '"' string_body * '"' | 'r' raw_string ;
char_body : non_single_quote
| '\x5c' [ '\x27' | common_escape | unicode_escape ] ;
string_body : non_double_quote
| '\x5c' [ '\x22' | common_escape | unicode_escape ] ;
raw_string : '"' raw_string_body '"' | '#' raw_string '#' ;
common_escape : '\x5c'
| 'n' | 'r' | 't' | '0'
| 'x' hex_digit 2
unicode_escape : 'u' '{' hex_digit+ 6 '}';
hex_digit : 'a' | 'b' | 'c' | 'd' | 'e' | 'f'
| 'A' | 'B' | 'C' | 'D' | 'E' | 'F'
| dec_digit ;
oct_digit : '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' ;
dec_digit : '0' | nonzero_dec ;
nonzero_dec: '1' | '2' | '3' | '4'
| '5' | '6' | '7' | '8' | '9' ;
```
#### Byte and byte string literals
```antlr
byte_lit : "b\x27" byte_body '\x27' ;
byte_string_lit : "b\x22" string_body * '\x22' | "br" raw_byte_string ;
byte_body : ascii_non_single_quote
| '\x5c' [ '\x27' | common_escape ] ;
byte_string_body : ascii_non_double_quote
| '\x5c' [ '\x22' | common_escape ] ;
raw_byte_string : '"' raw_byte_string_body '"' | '#' raw_byte_string '#' ;
```
#### Number literals
```antlr
num_lit : nonzero_dec [ dec_digit | '_' ] * float_suffix ?
| '0' [ [ dec_digit | '_' ] * float_suffix ?
| 'b' [ '1' | '0' | '_' ] +
| 'o' [ oct_digit | '_' ] +
| 'x' [ hex_digit | '_' ] + ] ;
float_suffix : [ exponent | '.' dec_lit exponent ? ] ? ;
exponent : ['E' | 'e'] ['-' | '+' ] ? dec_lit ;
dec_lit : [ dec_digit | '_' ] + ;
```
#### Boolean literals
```antlr
bool_lit : [ "true" | "false" ] ;
```
The two values of the boolean type are written `true` and `false`.
### Symbols
```antlr
symbol : "::" | "->"
| '#' | '[' | ']' | '(' | ')' | '{' | '}'
| ',' | ';' ;
```
Symbols are a general class of printable [tokens](#tokens) that play structural
roles in a variety of grammar productions. They are cataloged here for
completeness as the set of remaining miscellaneous printable tokens that do not
otherwise appear as [unary operators](#unary-operator-expressions), [binary
operators](#binary-operator-expressions), or [keywords](#keywords).
## Paths
```antlr
expr_path : [ "::" ] ident [ "::" expr_path_tail ] + ;
expr_path_tail : '<' type_expr [ ',' type_expr ] + '>'
| expr_path ;
type_path : ident [ type_path_tail ] + ;
type_path_tail : '<' type_expr [ ',' type_expr ] + '>'
| "::" type_path ;
```
# Syntax extensions
## Macros
```antlr
expr_macro_rules : "macro_rules" '!' ident '(' macro_rule * ')' ';'
| "macro_rules" '!' ident '{' macro_rule * '}' ;
macro_rule : '(' matcher * ')' "=>" '(' transcriber * ')' ';' ;
matcher : '(' matcher * ')' | '[' matcher * ']'
| '{' matcher * '}' | '$' ident ':' ident
| '$' '(' matcher * ')' sep_token? [ '*' | '+' ]
| non_special_token ;
transcriber : '(' transcriber * ')' | '[' transcriber * ']'
| '{' transcriber * '}' | '$' ident
| '$' '(' transcriber * ')' sep_token? [ '*' | '+' ]
| non_special_token ;
```
# Crates and source files
**FIXME:** grammar? What production covers #![crate_id = "foo"] ?
# Items and attributes
**FIXME:** grammar?
## Items
```antlr
item : vis ? mod_item | fn_item | type_item | struct_item | enum_item
| const_item | static_item | trait_item | impl_item | extern_block_item ;
```
### Type Parameters
**FIXME:** grammar?
### Modules
```antlr
mod_item : "mod" ident ( ';' | '{' mod '}' );
mod : [ view_item | item ] * ;
```
#### View items
```antlr
view_item : extern_crate_decl | use_decl ';' ;
```
##### Extern crate declarations
```antlr
extern_crate_decl : "extern" "crate" crate_name
crate_name: ident | ( ident "as" ident )
```
##### Use declarations
```antlr
use_decl : vis ? "use" [ path "as" ident
| path_glob ] ;
path_glob : ident [ "::" [ path_glob
| '*' ] ] ?
| '{' path_item [ ',' path_item ] * '}' ;
path_item : ident | "self" ;
```
### Functions
**FIXME:** grammar?
#### Generic functions
**FIXME:** grammar?
#### Unsafety
**FIXME:** grammar?
##### Unsafe functions
**FIXME:** grammar?
##### Unsafe blocks
**FIXME:** grammar?
#### Diverging functions
**FIXME:** grammar?
### Type definitions
**FIXME:** grammar?
### Structures
**FIXME:** grammar?
### Enumerations
**FIXME:** grammar?
### Constant items
```antlr
const_item : "const" ident ':' type '=' expr ';' ;
```
### Static items
```antlr
static_item : "static" ident ':' type '=' expr ';' ;
```
#### Mutable statics
**FIXME:** grammar?
### Traits
**FIXME:** grammar?
### Implementations
**FIXME:** grammar?
### External blocks
```antlr
extern_block_item : "extern" '{' extern_block '}' ;
extern_block : [ foreign_fn ] * ;
```
## Visibility and Privacy
```antlr
vis : "pub" ;
```
### Re-exporting and Visibility
See [Use declarations](#use-declarations).
## Attributes
```antlr
attribute : '#' '!' ? '[' meta_item ']' ;
meta_item : ident [ '=' literal
| '(' meta_seq ')' ] ? ;
meta_seq : meta_item [ ',' meta_seq ] ? ;
```
# Statements and expressions
## Statements
```antlr
stmt : decl_stmt | expr_stmt ;
```
### Declaration statements
```antlr
decl_stmt : item | let_decl ;
```
#### Item declarations
See [Items](#items).
#### Variable declarations
```antlr
let_decl : "let" pat [':' type ] ? [ init ] ? ';' ;
init : [ '=' ] expr ;
```
### Expression statements
```antlr
expr_stmt : expr ';' ;
```
## Expressions
```antlr
expr : literal | path | tuple_expr | unit_expr | struct_expr
| block_expr | method_call_expr | field_expr | array_expr
| idx_expr | range_expr | unop_expr | binop_expr
| paren_expr | call_expr | lambda_expr | while_expr
| loop_expr | break_expr | continue_expr | for_expr
| if_expr | match_expr | if_let_expr | while_let_expr
| return_expr ;
```
#### Lvalues, rvalues and temporaries
**FIXME:** grammar?
#### Moved and copied types
**FIXME:** Do we want to capture this in the grammar as different productions?
### Literal expressions
See [Literals](#literals).
### Path expressions
See [Paths](#paths).
### Tuple expressions
```antlr
tuple_expr : '(' [ expr [ ',' expr ] * | expr ',' ] ? ')' ;
```
### Unit expressions
```antlr
unit_expr : "()" ;
```
### Structure expressions
```antlr
struct_expr : expr_path '{' ident ':' expr
[ ',' ident ':' expr ] *
[ ".." expr ] '}' |
expr_path '(' expr
[ ',' expr ] * ')' |
expr_path ;
```
### Block expressions
```antlr
block_expr : '{' [ stmt ';' | item ] *
[ expr ] '}' ;
```
### Method-call expressions
```antlr
method_call_expr : expr '.' ident paren_expr_list ;
```
### Field expressions
```antlr
field_expr : expr '.' ident ;
```
### Array expressions
```antlr
array_expr : '[' "mut" ? array_elems? ']' ;
array_elems : [expr [',' expr]*] | [expr ';' expr] ;
```
### Index expressions
```antlr
idx_expr : expr '[' expr ']' ;
```
### Range expressions
```antlr
range_expr : expr ".." expr |
expr ".." |
".." expr |
".." ;
```
### Unary operator expressions
```antlr
unop_expr : unop expr ;
unop : '-' | '*' | '!' ;
```
### Binary operator expressions
```antlr
binop_expr : expr binop expr | type_cast_expr
| assignment_expr | compound_assignment_expr ;
binop : arith_op | bitwise_op | lazy_bool_op | comp_op
```
#### Arithmetic operators
```antlr
arith_op : '+' | '-' | '*' | '/' | '%' ;
```
#### Bitwise operators
```antlr
bitwise_op : '&' | '|' | '^' | "<<" | ">>" ;
```
#### Lazy boolean operators
```antlr
lazy_bool_op : "&&" | "||" ;
```
#### Comparison operators
```antlr
comp_op : "==" | "!=" | '<' | '>' | "<=" | ">=" ;
```
#### Type cast expressions
```antlr
type_cast_expr : value "as" type ;
```
#### Assignment expressions
```antlr
assignment_expr : expr '=' expr ;
```
#### Compound assignment expressions
```antlr
compound_assignment_expr : expr [ arith_op | bitwise_op ] '=' expr ;
```
### Grouped expressions
```antlr
paren_expr : '(' expr ')' ;
```
### Call expressions
```antlr
expr_list : [ expr [ ',' expr ]* ] ? ;
paren_expr_list : '(' expr_list ')' ;
call_expr : expr paren_expr_list ;
```
### Lambda expressions
```antlr
ident_list : [ ident [ ',' ident ]* ] ? ;
lambda_expr : '|' ident_list '|' expr ;
```
### While loops
```antlr
while_expr : [ lifetime ':' ] ? "while" no_struct_literal_expr '{' block '}' ;
```
### Infinite loops
```antlr
loop_expr : [ lifetime ':' ] ? "loop" '{' block '}';
```
### Break expressions
```antlr
break_expr : "break" [ lifetime ] ?;
```
### Continue expressions
```antlr
continue_expr : "continue" [ lifetime ] ?;
```
### For expressions
```antlr
for_expr : [ lifetime ':' ] ? "for" pat "in" no_struct_literal_expr '{' block '}' ;
```
### If expressions
```antlr
if_expr : "if" no_struct_literal_expr '{' block '}'
else_tail ? ;
else_tail : "else" [ if_expr | if_let_expr
| '{' block '}' ] ;
```
### Match expressions
```antlr
match_expr : "match" no_struct_literal_expr '{' match_arm * '}' ;
match_arm : attribute * match_pat "=>" [ expr "," | '{' block '}' ] ;
match_pat : pat [ '|' pat ] * [ "if" expr ] ? ;
```
### If let expressions
```antlr
if_let_expr : "if" "let" pat '=' expr '{' block '}'
else_tail ? ;
```
### While let loops
```antlr
while_let_expr : [ lifetime ':' ] ? "while" "let" pat '=' expr '{' block '}' ;
```
### Return expressions
```antlr
return_expr : "return" expr ? ;
```
# Type system
**FIXME:** is this entire chapter relevant here? Or should it all have been covered by some production already?
## Types
### Primitive types
**FIXME:** grammar?
#### Machine types
**FIXME:** grammar?
#### Machine-dependent integer types
**FIXME:** grammar?
### Textual types
**FIXME:** grammar?
### Tuple types
**FIXME:** grammar?
### Array, and Slice types
**FIXME:** grammar?
### Structure types
**FIXME:** grammar?
### Enumerated types
**FIXME:** grammar?
### Pointer types
**FIXME:** grammar?
### Function types
**FIXME:** grammar?
### Closure types
```antlr
closure_type := [ 'unsafe' ] [ '<' lifetime-list '>' ] '|' arg-list '|'
[ ':' bound-list ] [ '->' type ]
lifetime-list := lifetime | lifetime ',' lifetime-list
arg-list := ident ':' type | ident ':' type ',' arg-list
bound-list := bound | bound '+' bound-list
bound := path | lifetime
```
### Object types
**FIXME:** grammar?
### Type parameters
**FIXME:** grammar?
### Self types
**FIXME:** grammar?
## Type kinds
**FIXME:** this is probably not relevant to the grammar...
# Memory and concurrency models
**FIXME:** is this entire chapter relevant here? Or should it all have been covered by some production already?
## Memory model
### Memory allocation and lifetime
### Memory ownership
### Variables
### Boxes
## Threads
### Communication between threads
### Thread lifecycle