Add more aliases for Unicode confusable chars
Building upon #29837, this PR:
* added aliases for space characters,
* distinguished square brackets from parens, and
* added common CJK punctuation characters as aliases.
This will especially help CJK users who may have forgotten to switch off IME when coding.
Current code leads to messages like "... use a \xHH escape: \u{e4}"
which is confusing.
The printed span already points to the offending character, which
should be enough to identify the non-ASCII problem.
Fixes: #29088
syntax: Merge PathParsingMode::NoTypesAllowed and PathParsingMode::ImportPrefix
syntax: Rename PathParsingMode and its variants to better express their purpose
syntax: Remove obsolete error message about 'self lifetime
syntax: Remove ALLOW_MODULE_PATHS workaround
syntax/resolve: Adjust some error messages
resolve: Compare unhygienic (not renamed) names with keywords::Invalid, invalid identifiers may appear to be valid after renaming
This aligns with unicode recommendations and should be stable for all future
unicode releases. See http://unicode.org/reports/tr31/#R3.
This renames `libsyntax::lexer::is_whitespace` to `is_pattern_whitespace`
so potentially breaks users of libsyntax.
Given this code:
fn main() {
let _ = 'abcd';
}
The compiler would give a message like:
error: character literal may only contain one codepoint: ';
let _ = 'abcd';
^~
With this change, the message now displays:
error: character literal may only contain one codepoint: 'abcd'
let _ = 'abcd'
^~~~~~
Fixes#30033
If you try to put something that's bigger than a char into a char
literal, you get an error:
fn main() {
let c = 'ஶ்ரீ';
}
error: unterminated character constant:
This is a very compiler-centric message. Yes, it's technically
'unterminated', but that's not what you, the user did wrong.
Instead, this commit changes it to
error: character literal may only contain one codepoint
As this actually tells you what went wrong.
Fixes#28851
Previously, `/**/` was incorrectly regarded as a doc comment because it
starts with `/**` and ends with `*/`. However, this caused an ICE
because some code assumed that the length of a doc comment is at least
5. This commit adds an additional check to `is_block_doc_comment` that
tests the length of the input.
Fixes#28844.
This commit is an implementation of [RFC 1212][rfc] which tweaks the behavior of
the `str::lines` and `BufRead::lines` iterators. Both iterators now account for
`\r\n` sequences in addition to `\n`, allowing for less surprising behavior
across platforms (especially in the `BufRead` case). Splitting *only* on the
`\n` character can still be achieved with `split('\n')` in both cases.
The `str::lines_any` function is also now deprecated as `str::lines` is a
drop-in replacement for it.
[rfc]: https://github.com/rust-lang/rfcs/blob/master/text/1212-line-endings.mdCloses#28032
Inspired by the now-mysteriously-closed https://github.com/rust-lang/rust/pull/26782.
This PR introduces better error messages when unicode escapes have invalid format (e.g. `\uFFFF`). It also makes rustc always tell the user that escape may not be used in byte-strings and bytes and fixes some spans to not include unecessary characters and include escape backslash in some others.
This improves diagnostic messages when \u escape is used incorrectly and { is
missing. Instead of saying “unknown character escape: u”, it will now report
that unicode escape sequence is incomplete and suggest what the correct syntax
is.
This allows compiling entire crates from memory or preprocessing source files before they are tokenized.
Minor API refactoring included, which is a [breaking-change] for libsyntax users:
* `ParseSess::{next_node_id, reserve_node_ids}` moved to rustc's `Session`
* `new_parse_sess` -> `ParseSess::new`
* `new_parse_sess_special_handler` -> `ParseSess::with_span_handler`
* `mk_span_handler` -> `SpanHandler::new`
* `default_handler` -> `Handler::new`
* `mk_handler` -> `Handler::with_emitter`
* `string_to_filemap(sess source, path)` -> `sess.codemap().new_filemap(path, source)`
This changes the `ToTokens` implementations for expressions, statements,
etc. with almost-trivial ones that produce `Interpolated(*Nt(...))`
pseudo-tokens. In this way, quasiquote now works the same way as macros
do: already-parsed AST fragments are used as-is, not reparsed.
The `ToSource` trait is removed. Quasiquote no longer involves
pretty-printing at all, which removes the need for the
`encode_with_hygiene` hack. All associated machinery is removed.
A new `Nonterminal` is added, NtArm, which the parser now interpolates.
This is just for quasiquote, not macros (although it could be in the
future).
`ToTokens` is no longer implemented for `Arg` (although this could be
added again) and `Generics` (which I don't think makes sense).
This breaks any compiler extensions that relied on the ability of
`ToTokens` to turn AST fragments back into inspectable token trees. For
this reason, this closes#16987.
As such, this is a [breaking-change].
Fixes#16472.
Fixes#15962.
Fixes#17397.
Fixes#16617.
Changes the style guidelines regarding unit tests to recommend using a
sub-module named "tests" instead of "test" for unit tests as "test"
might clash with imports of libtest.