Use unchecked vec indexing since the vector bounds are checked by the
loop. Iterators are not easy to use in this case since we skip 1-4 bytes
each lap. This part of the commit speeds up is_utf8 for ASCII input.
Check codepoint ranges by checking the byte ranges manually instead of
computing a full decoding for multibyte encodings. This is easy to read
and corresponds to the UTF-8 syntax in the RFC.
No changes to what we accept. A comment notes that surrogate halves are
accepted.
Before:
test str::bench::is_utf8_100_ascii ... bench: 165 ns/iter (+/- 3)
test str::bench::is_utf8_100_multibyte ... bench: 218 ns/iter (+/- 5)
After:
test str::bench::is_utf8_100_ascii ... bench: 130 ns/iter (+/- 1)
test str::bench::is_utf8_100_multibyte ... bench: 156 ns/iter (+/- 3)
An improvement upon the previous pull #8133
Use unchecked vec indexing since the vector bounds are checked by the
loop. Iterators are not easy to use in this case since we skip 1-4 bytes
each lap. This part of the commit speeds up is_utf8 for ASCII input.
Check codepoint ranges by checking the byte ranges manually instead of
computing a full decoding for multibyte encodings. This is easy to read
and corresponds to the UTF-8 syntax in the RFC.
No changes to what we accept. A comment notes that surrogate halves are
accepted.
Before:
test str::bench::is_utf8_100_ascii ... bench: 165 ns/iter (+/- 3)
test str::bench::is_utf8_100_multibyte ... bench: 218 ns/iter (+/- 5)
After:
test str::bench::is_utf8_100_ascii ... bench: 130 ns/iter (+/- 1)
test str::bench::is_utf8_100_multibyte ... bench: 156 ns/iter (+/- 3)
The method .into_owned() is meant to be used as an optimization when you
need to get a ~str from a Str, but don't want to unnecessarily copy it
if it's already a ~str.
This is meant to ease functions that look like
fn foo<S: Str>(strs: &[S])
Previously they could work with the strings as slices using .as_slice(),
but producing ~str required copying the string, even if the vector
turned out be a &[~str] already.
Change the former repetition::
for 5.times { }
to::
do 5.times { }
.times() cannot be broken with `break` or `return` anymore; for those
cases, use a numerical range loop instead.
An 'overlong encoding' is a codepoint encoded non-minimally using the
utf-8 format. Denying these enforce each codepoint to have only one
valid representation in utf-8.
An example is byte sequence 0xE0 0x80 0x80 which could be interpreted as
U+0, but it's an overlong encoding since the canonical form is just
0x00.
Another example is 0xE0 0x80 0xAF which was previously accepted and is
an overlong encoding of the solidus "/". Directory traversal characters
like / and . form the most compelling argument for why this commit is
security critical.
Factor out common UTF-8 decoding expressions as macros. This commit will
partly duplicate UTF-8 decoding, so it is now present in both
fn is_utf8() and .char_range_at(); the latter using an assumption of
a valid str.
Bytes 0xC0, 0xC1 can only be used to start 2-byte codepoint encodings,
that are 'overlong encodings' of codepoints below 128.
The reference given in a comment -- https://tools.ietf.org/html/rfc3629
-- does in fact already exclude these bytes, so no additional comment
should be needed in the code.
Renamed bytes_iter to byte_iter to match other iterators
Refactored str Iterators to use DoubleEnded Iterators and typedefs instead of wrapper structs
Reordered the Iterator section
Whitespace fixup
Moved clunky `each_split_within` function to the one place in the tree where it's actually needed
Replaced all block doccomments in str with line doccomments
Drop the "Iterator" suffix for the the structs in std::iterator.
Filter, Zip, Chain etc. are shorter type names for when iterator
pipelines need their types written out in full in return value types, so
it's easier to read and write. the iterator module already forms enough
namespace.
This moves the raw struct layout of closures, vectors, boxes, and strings into a
new `unstable::raw` module. This is meant to be a centralized location to find
information for the layout of these values.
As safe method, `repr`, is provided to convert a rust value to its raw
representation. Unsafe methods to convert back are not provided because they are
rarely used and too numerous to write an implementation for each (not much of a
common pattern).
This is a cleanup pull request that does:
* removes `os::as_c_charp`
* moves `str::as_buf` and `str::as_c_str` into `StrSlice`
* converts some functions from `StrSlice::as_buf` to `StrSlice::as_c_str`
* renames `StrSlice::as_buf` to `StrSlice::as_imm_buf` (and adds `StrSlice::as_mut_buf` to match `vec.rs`.
* renames `UniqueStr::as_bytes_with_null_consume` to `UniqueStr::to_bytes`
* and other misc cleanups and minor optimizations
The theory is simple, the immutable iterators simply hold state
variables (indicies or pointers) into frozen containers. We can freely
clone these iterators, just like we can clone borrowed pointers.
VecIterator needs a manual impl to handle the lifetime struct member.
Moves multibyte code to it's own function to make char_range_at
easier to inline, and faster for single and multibyte chars.
Benchmarked reading example.json 100 times, 1.18s before, 1.08s
after.
It's broken/unmaintained and needs to be rewritten to avoid managed
pointers and needless copies. A full rewrite is necessary and the API
will need to be redone so it's not worth keeping this around.
Closes#2236, #2744