Design and open questions about libsyntax

The high-level description of the architecture is in RFC.md. You might also want to dig through https://github.com/matklad/fall/ which contains some pretty interesting stuff build using similar ideas (warning: it is completely undocumented, poorly written and in general not the thing which I recommend to study (yes, this is self-contradictory)).

Tree

The centerpiece of this whole endeavor is the syntax tree, in the tree module. Open questions:

how to best represent errors, to take advantage of the fact that they are rare, but to enable fully-persistent style structure sharing between tree nodes?
should we make red/green split from Roslyn more pronounced?
one can layout nodes in a single array in such a way that children of the node form a continuous slice. Seems nifty, but do we need it?
should we use SoA or AoS for NodeData?
should we split leaf nodes and internal nodes into separate arrays? Can we use it to save some bits here and there? (leaves don't need first_child field, for example).

Parser

The syntax tree is produced using a three-staged process.

First, a raw text is split into tokens with a lexer (the lexer module). Lexer has a peculiar signature: it is an Fn(&str) -> Token, where token is a pair of SyntaxKind (you should have read the tree module and RFC by this time! :)) and a len. That is, lexer chomps only the first token of the input. This forces the lexer to be stateless, and makes it possible to implement incremental relexing easily.

Then, the bulk of work, the parser turns a stream of tokens into stream of events (the parser module; of particular interest are the parser/event and parser/parser modules, which contain parsing API, and the parser/grammar module, which contains actual parsing code for various Rust syntactic constructs). Not that parser does not construct a tree right away. This is done for several reasons:

to decouple the actual tree data structure from the parser: you can build any data structure you want from the stream of events
to make parsing fast: you can produce a list of events without allocations
to make it easy to tweak tree structure. Consider this code:
```
#[cfg(test)]
pub fn foo() {}
```
Here, the attribute and the pub keyword must be the children of the fn node. However, when parsing them, we don't yet know if there would be a function ahead: it very well might be a struct there. If we use events, we generally don't care about this in parser and just spit them in order.
(Is this true?) to make incremental reparsing easier: you can reuse the same rope data structure for all of the original string, the tokens and the events.

The parser also does not know about whitespace tokens: it's the job of the next layer to assign whitespace and comments to nodes. However, parser can remap contextual tokens, like >> or union, so it has access to the text.

And at last, the TreeBuilder converts a flat stream of events into a tree structure. It also should be responsible for attaching comments and rebalancing the tree, but it does not do this yet :)

Validator

Parser and lexer accept a lot of invalid code intentionally. The idea is to post-process the tree and to proper error reporting, literal conversion and quick-fix suggestions. There is no design/implementation for this yet.

AST

Nothing yet, see AstNode in fall.

3.4 KiB Raw Blame History

Design and open questions about libsyntax

Tree

Parser

Validator

AST

3.4 KiB

Raw Blame History