Florian Hahn 1e62bd2754 Allow bare semicolon in grammar doc, closes #28157

2015-10-24 00:45:18 +02:00

18 KiB

Raw Blame History

% Grammar

Introduction

This document is the primary reference for the Rust programming language grammar. It provides only one kind of material:

Chapters that formally define the language grammar.

This document does not serve as an introduction to the language. Background familiarity with the language is assumed. A separate guide is available to help acquire such background.

This document also does not serve as a reference to the standard library included in the language distribution. Those libraries are documented separately by extracting documentation attributes from their source code. Many of the features that one might expect to be language features are library features in Rust, so what you're looking for may be there, not here.

Notation

Rust's grammar is defined over Unicode codepoints, each conventionally denoted U+XXXX, for 4 or more hexadecimal digits X. Most of Rust's grammar is confined to the ASCII range of Unicode, and is described in this document by a dialect of Extended Backus-Naur Form (EBNF), specifically a dialect of EBNF supported by common automated LL(k) parsing tools such as llgen, rather than the dialect given in ISO 14977. The dialect can be defined self-referentially as follows:

grammar : rule + ;
rule    : nonterminal ':' productionrule ';' ;
productionrule : production [ '|' production ] * ;
production : term * ;
term : element repeats ;
element : LITERAL | IDENTIFIER | '[' productionrule ']' ;
repeats : [ '*' | '+' ] NUMBER ? | NUMBER ? | '?' ;

Where:

Whitespace in the grammar is ignored.
Square brackets are used to group rules.
LITERAL is a single printable ASCII character, or an escaped hexadecimal ASCII code of the form \xQQ, in single quotes, denoting the corresponding Unicode codepoint U+00QQ.
IDENTIFIER is a nonempty string of ASCII letters and underscores.
The repeat forms apply to the adjacent element, and are as follows:
- ? means zero or one repetition
- * means zero or more repetitions
- + means one or more repetitions
- NUMBER trailing a repeat symbol gives a maximum repetition count
- NUMBER on its own gives an exact repetition count

This EBNF dialect should hopefully be familiar to many readers.

Unicode productions

A few productions in Rust's grammar permit Unicode codepoints outside the ASCII range. We define these productions in terms of character properties specified in the Unicode standard, rather than in terms of ASCII-range codepoints. The section Special Unicode Productions lists these productions.

String table productions

Some rules in the grammar — notably unary operators, binary operators, and keywords — are given in a simplified form: as a listing of a table of unquoted, printable whitespace-separated strings. These cases form a subset of the rules regarding the token rule, and are assumed to be the result of a lexical-analysis phase feeding the parser, driven by a DFA, operating over the disjunction of all such string table entries.

When such a string enclosed in double-quotes (") occurs inside the grammar, it is an implicit reference to a single member of such a string table production. See tokens for more information.

Lexical structure

Input format

Rust input is interpreted as a sequence of Unicode codepoints encoded in UTF-8. Most Rust grammar rules are defined in terms of printable ASCII-range codepoints, but a small number are defined in terms of Unicode properties or explicit codepoint lists. ¹

Special Unicode Productions

The following productions in the Rust grammar are defined in terms of Unicode properties: ident, non_null, non_eol, non_single_quote and non_double_quote.

Identifiers

The ident production is any nonempty Unicode² string of the following form:

The first character has property XID_start
The remaining characters have property XID_continue

that does not occur in the set of keywords.

Note

: XID_start and XID_continue as character properties cover the character ranges used to form the more familiar C and Java language-family identifiers.

Delimiter-restricted productions

Some productions are defined by exclusion of particular Unicode characters:

non_null is any single Unicode character aside from U+0000 (null)
non_eol is non_null restricted to exclude U+000A ('\n')
non_single_quote is non_null restricted to exclude U+0027 (')
non_double_quote is non_null restricted to exclude U+0022 (")

Comments

comment : block_comment | line_comment ;
block_comment : "/*" block_comment_body * "*/" ;
block_comment_body : [block_comment | character] * ;
line_comment : "//" non_eol * ;

FIXME: add doc grammar?

Whitespace

whitespace_char : '\x20' | '\x09' | '\x0a' | '\x0d' ;
whitespace : [ whitespace_char | comment ] + ;

Tokens

simple_token : keyword | unop | binop ;
token : simple_token | ident | literal | symbol | whitespace token ;

Keywords


abstract	alignof	as	become	box
break	const	continue	crate	do
else	enum	extern	false	final
fn	for	if	impl	in
let	loop	macro	match	mod
move	mut	offsetof	override	priv
proc	pub	pure	ref	return
Self	self	sizeof	static	struct
super	trait	true	type	typeof
unsafe	unsized	use	virtual	where
while	yield

Each of these keywords has special meaning in its grammar, and all of them are excluded from the ident rule.

Literals

lit_suffix : ident;
literal : [ string_lit | char_lit | byte_string_lit | byte_lit | num_lit | bool_lit ] lit_suffix ?;

The optional lit_suffix production is only used for certain numeric literals, but is reserved for future extension. That is, the above gives the lexical grammar, but a Rust parser will reject everything but the 12 special cases mentioned in Number literals in the reference.

Character and string literals

char_lit : '\x27' char_body '\x27' ;
string_lit : '"' string_body * '"' | 'r' raw_string ;

char_body : non_single_quote
          | '\x5c' [ '\x27' | common_escape | unicode_escape ] ;

string_body : non_double_quote
            | '\x5c' [ '\x22' | common_escape | unicode_escape ] ;
raw_string : '"' raw_string_body '"' | '#' raw_string '#' ;

common_escape : '\x5c'
              | 'n' | 'r' | 't' | '0'
              | 'x' hex_digit 2
unicode_escape : 'u' '{' hex_digit+ 6 '}';

hex_digit : 'a' | 'b' | 'c' | 'd' | 'e' | 'f'
          | 'A' | 'B' | 'C' | 'D' | 'E' | 'F'
          | dec_digit ;
oct_digit : '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' ;
dec_digit : '0' | nonzero_dec ;
nonzero_dec: '1' | '2' | '3' | '4'
           | '5' | '6' | '7' | '8' | '9' ;

Byte and byte string literals

byte_lit : "b\x27" byte_body '\x27' ;
byte_string_lit : "b\x22" string_body * '\x22' | "br" raw_byte_string ;

byte_body : ascii_non_single_quote
          | '\x5c' [ '\x27' | common_escape ] ;

byte_string_body : ascii_non_double_quote
            | '\x5c' [ '\x22' | common_escape ] ;
raw_byte_string : '"' raw_byte_string_body '"' | '#' raw_byte_string '#' ;

Number literals

num_lit : nonzero_dec [ dec_digit | '_' ] * float_suffix ?
        | '0' [       [ dec_digit | '_' ] * float_suffix ?
              | 'b'   [ '1' | '0' | '_' ] +
              | 'o'   [ oct_digit | '_' ] +
              | 'x'   [ hex_digit | '_' ] +  ] ;

float_suffix : [ exponent | '.' dec_lit exponent ? ] ? ;

exponent : ['E' | 'e'] ['-' | '+' ] ? dec_lit ;
dec_lit : [ dec_digit | '_' ] + ;

Boolean literals

bool_lit : [ "true" | "false" ] ;

The two values of the boolean type are written true and false.

Symbols

symbol : "::" | "->"
       | '#' | '[' | ']' | '(' | ')' | '{' | '}'
       | ',' | ';' ;

Symbols are a general class of printable tokens that play structural roles in a variety of grammar productions. They are cataloged here for completeness as the set of remaining miscellaneous printable tokens that do not otherwise appear as unary operators, binary operators, or keywords.

Paths

expr_path : [ "::" ] ident [ "::" expr_path_tail ] + ;
expr_path_tail : '<' type_expr [ ',' type_expr ] + '>'
               | expr_path ;

type_path : ident [ type_path_tail ] + ;
type_path_tail : '<' type_expr [ ',' type_expr ] + '>'
               | "::" type_path ;

Syntax extensions

Macros

expr_macro_rules : "macro_rules" '!' ident '(' macro_rule * ')' ';'
                 | "macro_rules" '!' ident '{' macro_rule * '}' ;
macro_rule : '(' matcher * ')' "=>" '(' transcriber * ')' ';' ;
matcher : '(' matcher * ')' | '[' matcher * ']'
        | '{' matcher * '}' | '$' ident ':' ident
        | '$' '(' matcher * ')' sep_token? [ '*' | '+' ]
        | non_special_token ;
transcriber : '(' transcriber * ')' | '[' transcriber * ']'
            | '{' transcriber * '}' | '$' ident
            | '$' '(' transcriber * ')' sep_token? [ '*' | '+' ]
            | non_special_token ;

Crates and source files

FIXME: grammar? What production covers #![crate_id = "foo"] ?

Items and attributes

FIXME: grammar?

Items

item : vis ? mod_item | fn_item | type_item | struct_item | enum_item
     | const_item | static_item | trait_item | impl_item | extern_block_item ;

Type Parameters

FIXME: grammar?

Modules

mod_item : "mod" ident ( ';' | '{' mod '}' );
mod : [ view_item | item ] * ;

View items

view_item : extern_crate_decl | use_decl ';' ;

Extern crate declarations

extern_crate_decl : "extern" "crate" crate_name
crate_name: ident | ( ident "as" ident )

Use declarations

use_decl : vis ? "use" [ path "as" ident
                        | path_glob ] ;

path_glob : ident [ "::" [ path_glob
                          | '*' ] ] ?
          | '{' path_item [ ',' path_item ] * '}' ;

path_item : ident | "self" ;

Functions

FIXME: grammar?

Generic functions

FIXME: grammar?

Unsafety

FIXME: grammar?

Unsafe functions

FIXME: grammar?

Unsafe blocks

FIXME: grammar?

Diverging functions

FIXME: grammar?

Type definitions

FIXME: grammar?

Structures

FIXME: grammar?

Enumerations

FIXME: grammar?

Constant items

const_item : "const" ident ':' type '=' expr ';' ;

Static items

static_item : "static" ident ':' type '=' expr ';' ;

Mutable statics

FIXME: grammar?

Traits

FIXME: grammar?

Implementations

FIXME: grammar?

External blocks

extern_block_item : "extern" '{' extern_block '}' ;
extern_block : [ foreign_fn ] * ;

Visibility and Privacy

vis : "pub" ;

Re-exporting and Visibility

See Use declarations.

Attributes

attribute : '#' '!' ? '[' meta_item ']' ;
meta_item : ident [ '=' literal
                  | '(' meta_seq ')' ] ? ;
meta_seq : meta_item [ ',' meta_seq ] ? ;

Statements and expressions

Statements

stmt : decl_stmt | expr_stmt | ';' ;

Declaration statements

decl_stmt : item | let_decl ;

Item declarations

See Items.

Variable declarations

let_decl : "let" pat [':' type ] ? [ init ] ? ';' ;
init : [ '=' ] expr ;

Expression statements

expr_stmt : expr ';' ;

Expressions

expr : literal | path | tuple_expr | unit_expr | struct_expr
     | block_expr | method_call_expr | field_expr | array_expr
     | idx_expr | range_expr | unop_expr | binop_expr
     | paren_expr | call_expr | lambda_expr | while_expr
     | loop_expr | break_expr | continue_expr | for_expr
     | if_expr | match_expr | if_let_expr | while_let_expr
     | return_expr ;

Lvalues, rvalues and temporaries

FIXME: grammar?

Moved and copied types

FIXME: Do we want to capture this in the grammar as different productions?

Literal expressions

See Literals.

Path expressions

See Paths.

Tuple expressions

tuple_expr : '(' [ expr [ ',' expr ] * | expr ',' ] ? ')' ;

Unit expressions

unit_expr : "()" ;

Structure expressions

struct_expr : expr_path '{' ident ':' expr
                      [ ',' ident ':' expr ] *
                      [ ".." expr ] '}' |
              expr_path '(' expr
                      [ ',' expr ] * ')' |
              expr_path ;

Block expressions

block_expr : '{' [ stmt ';' | item ] *
                 [ expr ] '}' ;

Method-call expressions

method_call_expr : expr '.' ident paren_expr_list ;

Field expressions

field_expr : expr '.' ident ;

Array expressions

array_expr : '[' "mut" ? array_elems? ']' ;

array_elems : [expr [',' expr]*] | [expr ';' expr] ;

Index expressions

idx_expr : expr '[' expr ']' ;

Range expressions

range_expr : expr ".." expr |
             expr ".." |
             ".." expr |
             ".." ;

Unary operator expressions

unop_expr : unop expr ;
unop : '-' | '*' | '!' ;

Binary operator expressions

binop_expr : expr binop expr | type_cast_expr
           | assignment_expr | compound_assignment_expr ;
binop : arith_op | bitwise_op | lazy_bool_op | comp_op

Arithmetic operators

arith_op : '+' | '-' | '*' | '/' | '%' ;

Bitwise operators

bitwise_op : '&' | '|' | '^' | "<<" | ">>" ;

Lazy boolean operators

lazy_bool_op : "&&" | "||" ;

Comparison operators

comp_op : "==" | "!=" | '<' | '>' | "<=" | ">=" ;

Type cast expressions

type_cast_expr : value "as" type ;

Assignment expressions

assignment_expr : expr '=' expr ;

Compound assignment expressions

compound_assignment_expr : expr [ arith_op | bitwise_op ] '=' expr ;

Grouped expressions

paren_expr : '(' expr ')' ;

Call expressions

expr_list : [ expr [ ',' expr ]* ] ? ;
paren_expr_list : '(' expr_list ')' ;
call_expr : expr paren_expr_list ;

Lambda expressions

ident_list : [ ident [ ',' ident ]* ] ? ;
lambda_expr : '|' ident_list '|' expr ;

While loops

while_expr : [ lifetime ':' ] ? "while" no_struct_literal_expr '{' block '}' ;

Infinite loops

loop_expr : [ lifetime ':' ] ? "loop" '{' block '}';

Break expressions

break_expr : "break" [ lifetime ] ?;

Continue expressions

continue_expr : "continue" [ lifetime ] ?;

For expressions

for_expr : [ lifetime ':' ] ? "for" pat "in" no_struct_literal_expr '{' block '}' ;

If expressions

if_expr : "if" no_struct_literal_expr '{' block '}'
          else_tail ? ;

else_tail : "else" [ if_expr | if_let_expr
                   | '{' block '}' ] ;

Match expressions

match_expr : "match" no_struct_literal_expr '{' match_arm * '}' ;

match_arm : attribute * match_pat "=>" [ expr "," | '{' block '}' ] ;

match_pat : pat [ '|' pat ] * [ "if" expr ] ? ;

If let expressions

if_let_expr : "if" "let" pat '=' expr '{' block '}'
               else_tail ? ;

While let loops

while_let_expr : [ lifetime ':' ] ? "while" "let" pat '=' expr '{' block '}' ;

Return expressions

return_expr : "return" expr ? ;

Type system

FIXME: is this entire chapter relevant here? Or should it all have been covered by some production already?

Types

Primitive types

FIXME: grammar?

Machine types

FIXME: grammar?

Machine-dependent integer types

FIXME: grammar?

Textual types

FIXME: grammar?

Tuple types

FIXME: grammar?

Array, and Slice types

FIXME: grammar?

Structure types

FIXME: grammar?

Enumerated types

FIXME: grammar?

Pointer types

FIXME: grammar?

Function types

FIXME: grammar?

Closure types

closure_type := [ 'unsafe' ] [ '<' lifetime-list '>' ] '|' arg-list '|'
                [ ':' bound-list ] [ '->' type ]
lifetime-list := lifetime | lifetime ',' lifetime-list
arg-list := ident ':' type | ident ':' type ',' arg-list
bound-list := bound | bound '+' bound-list
bound := path | lifetime

Object types

FIXME: grammar?

Type parameters

FIXME: grammar?

Self types

FIXME: grammar?

Type kinds

FIXME: this is probably not relevant to the grammar...

Memory and concurrency models

FIXME: is this entire chapter relevant here? Or should it all have been covered by some production already?

Memory model

Memory allocation and lifetime

Memory ownership

Variables

Boxes

Threads

Communication between threads

Thread lifecycle

Substitute definitions for the special Unicode productions are provided to the grammar verifier, restricted to ASCII range, when verifying the grammar in this document. ↩︎
Non-ASCII characters in identifiers are currently feature gated. This is expected to improve soon. ↩︎

18 KiB Raw Blame History

Introduction

Notation

Unicode productions

String table productions

Lexical structure

Input format

Special Unicode Productions

Identifiers

Delimiter-restricted productions

Comments

Whitespace

Tokens

Keywords

Literals

Character and string literals

Byte and byte string literals

Number literals

Boolean literals

Symbols

Paths

Syntax extensions

Macros

Crates and source files

Items and attributes

Items

Type Parameters

Modules

View items

Extern crate declarations

Use declarations

Functions

Generic functions

Unsafety

Unsafe functions

Unsafe blocks

Diverging functions

Type definitions

Structures

Enumerations

Constant items

Static items

Mutable statics

Traits

Implementations

External blocks

Visibility and Privacy

Re-exporting and Visibility

Attributes

Statements and expressions

Statements

Declaration statements

Item declarations

Variable declarations

Expression statements

Expressions

Lvalues, rvalues and temporaries

Moved and copied types

Literal expressions

Path expressions

Tuple expressions

Unit expressions

Structure expressions

Block expressions

Method-call expressions

Field expressions

Array expressions

Index expressions

Range expressions

Unary operator expressions

Binary operator expressions

Arithmetic operators

Bitwise operators

Lazy boolean operators

Comparison operators

Type cast expressions

Assignment expressions

Compound assignment expressions

Grouped expressions

Call expressions

18 KiB

Raw Blame History