Commit Graph

136 Commits

Author SHA1 Message Date
bors
f7c4359a2c auto merge of #8237 : blake2-ppc/rust/faster-utf8, r=brson
Use unchecked vec indexing since the vector bounds are checked by the
loop. Iterators are not easy to use in this case since we skip 1-4 bytes
each lap. This part of the commit speeds up is_utf8 for ASCII input.

Check codepoint ranges by checking the byte ranges manually instead of
computing a full decoding for multibyte encodings. This is easy to read
and corresponds to the UTF-8 syntax in the RFC.

No changes to what we accept. A comment notes that surrogate halves are
accepted.

Before:

	test str::bench::is_utf8_100_ascii ... bench: 165 ns/iter (+/- 3)
	test str::bench::is_utf8_100_multibyte ... bench: 218 ns/iter (+/- 5)

After:
	test str::bench::is_utf8_100_ascii ... bench: 130 ns/iter (+/- 1)
	test str::bench::is_utf8_100_multibyte ... bench: 156 ns/iter (+/- 3)

An improvement upon the previous pull #8133
2013-08-04 07:10:56 -07:00
Daniel Micay
1008945528 remove obsolete foreach keyword
this has been replaced by `for`
2013-08-03 22:48:02 -04:00
blake2-ppc
0504d7e57b std: Speed up str::is_utf8
Use unchecked vec indexing since the vector bounds are checked by the
loop. Iterators are not easy to use in this case since we skip 1-4 bytes
each lap. This part of the commit speeds up is_utf8 for ASCII input.

Check codepoint ranges by checking the byte ranges manually instead of
computing a full decoding for multibyte encodings. This is easy to read
and corresponds to the UTF-8 syntax in the RFC.

No changes to what we accept. A comment notes that surrogate halves are
accepted.

Before:

	test str::bench::is_utf8_100_ascii ... bench: 165 ns/iter (+/- 3)
	test str::bench::is_utf8_100_multibyte ... bench: 218 ns/iter (+/- 5)

After:
	test str::bench::is_utf8_100_ascii ... bench: 130 ns/iter (+/- 1)
	test str::bench::is_utf8_100_multibyte ... bench: 156 ns/iter (+/- 3)
2013-08-02 23:20:57 +02:00
Kevin Ballard
aa94dfa625 str: Add method .into_owned(self) -> ~str to Str
The method .into_owned() is meant to be used as an optimization when you
need to get a ~str from a Str, but don't want to unnecessarily copy it
if it's already a ~str.

This is meant to ease functions that look like

  fn foo<S: Str>(strs: &[S])

Previously they could work with the strings as slices using .as_slice(),
but producing ~str required copying the string, even if the vector
turned out be a &[~str] already.
2013-08-01 15:54:58 -07:00
blake2-ppc
78cde5b9fb std: Change Times trait to use do instead of for
Change the former repetition::

    for 5.times { }

to::

    do 5.times { }

.times() cannot be broken with `break` or `return` anymore; for those
cases, use a numerical range loop instead.
2013-08-01 16:54:22 +02:00
Daniel Micay
1fc4db2d08 migrate many for loops to foreach 2013-08-01 05:34:55 -04:00
Daniel Micay
dabd476203 make in and foreach get treated as keywords 2013-08-01 00:21:13 -04:00
blake2-ppc
8f9014c159 std: Mark the static constants in str.rs as private
static variables are pub by default, which is not reflected in our code
(we need to use priv).
2013-07-30 19:34:54 +02:00
blake2-ppc
aa89325cb0 std: Add from_bytes test for utf-8 using codepoints above 0xffff 2013-07-30 19:16:12 +02:00
blake2-ppc
b4ff95599a std: Deny overlong encodings in UTF-8
An 'overlong encoding' is a codepoint encoded non-minimally using the
utf-8 format. Denying these enforce each codepoint to have only one
valid representation in utf-8.

An example is byte sequence 0xE0 0x80 0x80 which could be interpreted as
U+0, but it's an overlong encoding since the canonical form is just
0x00.

Another example is 0xE0 0x80 0xAF which was previously accepted and is
an overlong encoding of the solidus "/". Directory traversal characters
like / and . form the most compelling argument for why this commit is
security critical.

Factor out common UTF-8 decoding expressions as macros. This commit will
partly duplicate UTF-8 decoding, so it is now present in both
fn is_utf8() and .char_range_at(); the latter using an assumption of
a valid str.
2013-07-30 19:16:12 +02:00
blake2-ppc
6dd185930d std: Disallow bytes 0xC0, 0xC1 (192, 193) in utf-8
Bytes 0xC0, 0xC1 can only be used to start 2-byte codepoint encodings,
that are 'overlong encodings' of codepoints below 128.

The reference given in a comment -- https://tools.ietf.org/html/rfc3629
-- does in fact already exclude these bytes, so no additional comment
should be needed in the code.
2013-07-30 17:25:29 +02:00
bors
576f395ddf auto merge of #8121 : thestinger/rust/offset, r=alexcrichton
Closes #8118, #7136

~~~rust
extern mod extra;

use std::vec;
use std::ptr;

fn bench_from_elem(b: &mut extra::test::BenchHarness) {
    do b.iter {
        let v: ~[u8] = vec::from_elem(1024, 0u8);
    }
}

fn bench_set_memory(b: &mut extra::test::BenchHarness) {
    do b.iter {
        let mut v: ~[u8] = vec::with_capacity(1024);
        unsafe {
            let vp = vec::raw::to_mut_ptr(v);
            ptr::set_memory(vp, 0, 1024);
            vec::raw::set_len(&mut v, 1024);
        }
    }
}

fn bench_vec_repeat(b: &mut extra::test::BenchHarness) {
    do b.iter {
        let v: ~[u8] = ~[0u8, ..1024];
    }
}
~~~

Before:

    test bench_from_elem ... bench: 415 ns/iter (+/- 17)
    test bench_set_memory ... bench: 85 ns/iter (+/- 4)
    test bench_vec_repeat ... bench: 83 ns/iter (+/- 3)

After:

    test bench_from_elem ... bench: 84 ns/iter (+/- 2)
    test bench_set_memory ... bench: 84 ns/iter (+/- 5)
    test bench_vec_repeat ... bench: 84 ns/iter (+/- 3)
2013-07-30 07:01:19 -07:00
Marvin Löbel
e33fca9ffe Added str::char_offset_iter() and str::rev_char_offset_iter()
Renamed bytes_iter to byte_iter to match other iterators
Refactored str Iterators to use DoubleEnded Iterators and typedefs instead of wrapper structs
Reordered the Iterator section
Whitespace fixup
Moved clunky `each_split_within` function to the one place in the tree where it's actually needed
Replaced all block doccomments in str with line doccomments
2013-07-30 12:55:48 +02:00
Daniel Micay
ef870d37a5 implement pointer arithmetic with GEP
Closes #8118, #7136

~~~rust
extern mod extra;

use std::vec;
use std::ptr;

fn bench_from_elem(b: &mut extra::test::BenchHarness) {
    do b.iter {
        let v: ~[u8] = vec::from_elem(1024, 0u8);
    }
}

fn bench_set_memory(b: &mut extra::test::BenchHarness) {
    do b.iter {
        let mut v: ~[u8] = vec::with_capacity(1024);
        unsafe {
            let vp = vec::raw::to_mut_ptr(v);
            ptr::set_memory(vp, 0, 1024);
            vec::raw::set_len(&mut v, 1024);
        }
    }
}

fn bench_vec_repeat(b: &mut extra::test::BenchHarness) {
    do b.iter {
        let v: ~[u8] = ~[0u8, ..1024];
    }
}
~~~

Before:

    test bench_from_elem ... bench: 415 ns/iter (+/- 17)
    test bench_set_memory ... bench: 85 ns/iter (+/- 4)
    test bench_vec_repeat ... bench: 83 ns/iter (+/- 3)

After:

    test bench_from_elem ... bench: 84 ns/iter (+/- 2)
    test bench_set_memory ... bench: 84 ns/iter (+/- 5)
    test bench_vec_repeat ... bench: 84 ns/iter (+/- 3)
2013-07-30 02:50:31 -04:00
blake2-ppc
5307d3674e std: Implement Extendable for hashmap, str and trie 2013-07-30 02:32:38 +02:00
blake2-ppc
4b45f47881 std: Rename Iterator adaptor types to drop the -Iterator suffix
Drop the "Iterator" suffix for the the structs in std::iterator.
Filter, Zip, Chain etc. are shorter type names for when iterator
pipelines need their types written out in full in return value types, so
it's easier to read and write. the iterator module already forms enough
namespace.
2013-07-29 04:20:56 +02:00
blake2-ppc
4849a42bf6 std: Implement FromIterator for ~str
FromIterator initially only implemented for Iterator<char>, which is the
type of the main iterator.
2013-07-29 02:40:28 +02:00
jmgrosen
a0f0f3012e Refactored vec and str iterators to remove prefixes 2013-07-28 13:37:35 -07:00
bors
5157e05049 auto merge of #8036 : sfackler/rust/container-impls, r=msullivan
A couple of implementations of Container::is_empty weren't exactly
self.len() == 0 so I left them alone (e.g. Treemap).
2013-07-27 11:16:31 -07:00
Alex Crichton
5aaaca0c6a Consolidate raw representations of rust values
This moves the raw struct layout of closures, vectors, boxes, and strings into a
new `unstable::raw` module. This is meant to be a centralized location to find
information for the layout of these values.

As safe method, `repr`, is provided to convert a rust value to its raw
representation. Unsafe methods to convert back are not provided because they are
rarely used and too numerous to write an implementation for each (not much of a
common pattern).
2013-07-26 09:53:03 -07:00
Steven Fackler
feb18fe8da Added default impls for container methods
A couple of implementations of Container::is_empty weren't exactly
self.len() == 0 so I left them alone (e.g. Treemap).
2013-07-25 15:17:30 -07:00
bors
330378d1a1 auto merge of #7996 : erickt/rust/cleanup-strs, r=erickt
This is a cleanup pull request that does:

* removes `os::as_c_charp`
* moves `str::as_buf` and `str::as_c_str` into `StrSlice`
* converts some functions from `StrSlice::as_buf` to `StrSlice::as_c_str`
* renames `StrSlice::as_buf` to `StrSlice::as_imm_buf` (and adds `StrSlice::as_mut_buf` to match `vec.rs`.
* renames `UniqueStr::as_bytes_with_null_consume` to `UniqueStr::to_bytes`
* and other misc cleanups and minor optimizations
2013-07-24 13:25:36 -07:00
Birunthan Mohanathas
d047cf1ec6 Change 'print(fmt!(...))' to printf!/printfln! in src/lib* 2013-07-24 09:45:20 -04:00
Erick Tryzelaar
9c3679a9a2 std: make str::append move self
This eliminates a copy and fixes a FIXME.
2013-07-23 16:57:00 -07:00
Erick Tryzelaar
bbedbc0450 std: inline str::with_capacity and vec::with_capacity 2013-07-23 16:57:00 -07:00
Erick Tryzelaar
cced3c9013 std: simplify str::as_imm_buf and vec::as_{imm,mut}_buf 2013-07-23 16:57:00 -07:00
Erick Tryzelaar
037a5b1af4 str: move as_mut_buf into OwnedStr, and make it self 2013-07-23 16:56:58 -07:00
Erick Tryzelaar
31b77aecfc std: remove str::to_owned and str::raw::slice_bytes_owned 2013-07-23 16:56:23 -07:00
Erick Tryzelaar
cc9666f68f std: rename str.as_buf to as_imm_buf, add str.as_mut_buf 2013-07-23 16:56:22 -07:00
Erick Tryzelaar
cf75330807 std: add test for str::as_c_str 2013-07-23 16:56:22 -07:00
Erick Tryzelaar
7af56bb921 std: move StrUtil::as_c_str into StrSlice 2013-07-23 16:56:22 -07:00
Erick Tryzelaar
9fdec67a67 std: move str::as_buf into StrSlice 2013-07-23 16:56:22 -07:00
Erick Tryzelaar
9ad815e063 std: rename str.as_bytes_with_null_consume to str.to_bytes_with_null 2013-07-23 16:56:17 -07:00
Graydon Hoare
978e5d94bc std: wrap "long" utf8 lines. 2013-07-23 16:02:14 -07:00
Graydon Hoare
e5cbede103 std: add preliminary str benchmark. 2013-07-22 16:56:10 -07:00
Daniel Micay
ed67cdb73c new snapshot 2013-07-22 01:09:48 -04:00
bors
fe3f75ff8e auto merge of #7932 : blake2-ppc/rust/str-clear, r=huonw
~str and @str need separate implementations for use in generic
functions, where it will not automatically use the impl on &str.

fixes issue #7900
2013-07-21 15:28:38 -07:00
blake2-ppc
24b6901b26 std: Implement Clone for VecIterator and iterators using it
The theory is simple, the immutable iterators simply hold state
variables (indicies or pointers) into frozen containers. We can freely
clone these iterators, just like we can clone borrowed pointers.

VecIterator needs a manual impl to handle the lifetime struct member.
2013-07-20 20:30:57 +02:00
blake2-ppc
3509f9d5ae str: Implement Container for ~str, @str and Mutable for ~str
~str and @str need separate implementations for use in generic
functions, where it will not automatically use the impl on &str.
2013-07-20 19:28:38 +02:00
Patrick Walton
99b33f7219 librustc: Remove all uses of "copy". 2013-07-17 14:57:51 -07:00
Daniel Micay
e118555ce6 remove headers from unique vectors 2013-07-15 23:57:27 -04:00
Gary Linscott
5aee5a11e3 Optimize is_utf8
Manually unroll the multibyte loops, and optimize for the single
byte chars.
2013-07-11 14:23:15 -04:00
Gary Linscott
179637304a char_range_at perf work
Moves multibyte code to it's own function to make char_range_at
easier to inline, and faster for single and multibyte chars.

Benchmarked reading example.json 100 times, 1.18s before, 1.08s
after.
2013-07-11 14:23:14 -04:00
bors
30c8aac677 auto merge of #7612 : thestinger/rust/utf8, r=huonw 2013-07-08 16:10:53 -07:00
Daniel Micay
44770ae3a8 Merge pull request #7595 from thestinger/iterator
remove some method resolve workarounds
2013-07-08 01:42:07 -07:00
Daniel Micay
641aec7407 remove some method resolve workarounds 2013-07-07 19:51:13 -04:00
Daniel Micay
01833de7ea remove extra::rope
It's broken/unmaintained and needs to be rewritten to avoid managed
pointers and needless copies. A full rewrite is necessary and the API
will need to be redone so it's not worth keeping this around.

Closes #2236, #2744
2013-07-06 17:06:30 -04:00
Daniel Micay
51eb1e14d4 str: stop encoding invalid out-of-range char 2013-07-05 21:07:37 -04:00
Huon Wilson
cdea73cf5b Convert vec::{as_imm_buf, as_mut_buf} to methods. 2013-07-04 00:46:50 +10:00
Huon Wilson
c437a16c5d rustc: add a lint to enforce uppercase statics. 2013-07-01 17:52:57 +10:00