Commit Graph

58 Commits

Author SHA1 Message Date
Tobias Bucher
68efea08fa Restore char::escape_default and add char::escape instead 2016-07-26 15:15:00 +02:00
Tobias Bucher
e7d16580f5 Escape fewer Unicode codepoints in Debug impl of str
Use the same procedure as Python to determine whether a character is
printable, described in [PEP 3138]. In particular, this means that the
following character classes are escaped:

- Cc (Other, Control)
- Cf (Other, Format)
- Cs (Other, Surrogate), even though they can't appear in Rust strings
- Co (Other, Private Use)
- Cn (Other, Not Assigned)
- Zl (Separator, Line)
- Zp (Separator, Paragraph)
- Zs (Separator, Space), except for the ASCII space `' '` (`0x20`)

This allows for user-friendly inspection of strings that are not
English (e.g. compare `"\u{e9}\u{e8}\u{ea}"` to `"éèê"`).

Fixes #34318.

[PEP 3138]: https://www.python.org/dev/peps/pep-3138/
2016-07-23 00:18:44 +02:00
Srinivas Reddy Thatiparthy
c605480521 clean up for test cases 2016-06-09 08:20:08 +05:30
Srinivas Reddy Thatiparthy
c6ed7adf7a remove redundant assert statements 2016-06-09 08:12:31 +05:30
Alex Crichton
b64c9d5670 std: Clean out old unstable + deprecated APIs
These should all have been deprecated for at least one cycle, so this commit
cleans them all out.
2016-05-30 20:46:32 -07:00
Steve Klabnik
657cae03e9 Rollup merge of #32869 - bluss:char-boundary-test, r=brson
Add test for is_char_boundary

Add test for is_char_boundary

Apparently there was no test for this method. This test is rather simple, not exhaustive.
2016-04-14 14:49:09 -04:00
Alex Crichton
552eda70d3 std: Stabilize APIs for the 1.9 release
This commit applies all stabilizations, renamings, and deprecations that the
library team has decided on for the upcoming 1.9 release. All tracking issues
have gone through a cycle-long "final comment period" and the specific APIs
stabilized/deprecated are:

Stable

* `std::panic`
* `std::panic::catch_unwind` (renamed from `recover`)
* `std::panic::resume_unwind` (renamed from `propagate`)
* `std::panic::AssertUnwindSafe` (renamed from `AssertRecoverSafe`)
* `std::panic::UnwindSafe` (renamed from `RecoverSafe`)
* `str::is_char_boundary`
* `<*const T>::as_ref`
* `<*mut T>::as_ref`
* `<*mut T>::as_mut`
* `AsciiExt::make_ascii_uppercase`
* `AsciiExt::make_ascii_lowercase`
* `char::decode_utf16`
* `char::DecodeUtf16`
* `char::DecodeUtf16Error`
* `char::DecodeUtf16Error::unpaired_surrogate`
* `BTreeSet::take`
* `BTreeSet::replace`
* `BTreeSet::get`
* `HashSet::take`
* `HashSet::replace`
* `HashSet::get`
* `OsString::with_capacity`
* `OsString::clear`
* `OsString::capacity`
* `OsString::reserve`
* `OsString::reserve_exact`
* `OsStr::is_empty`
* `OsStr::len`
* `std::os::unix::thread`
* `RawPthread`
* `JoinHandleExt`
* `JoinHandleExt::as_pthread_t`
* `JoinHandleExt::into_pthread_t`
* `HashSet::hasher`
* `HashMap::hasher`
* `CommandExt::exec`
* `File::try_clone`
* `SocketAddr::set_ip`
* `SocketAddr::set_port`
* `SocketAddrV4::set_ip`
* `SocketAddrV4::set_port`
* `SocketAddrV6::set_ip`
* `SocketAddrV6::set_port`
* `SocketAddrV6::set_flowinfo`
* `SocketAddrV6::set_scope_id`
* `<[T]>::copy_from_slice`
* `ptr::read_volatile`
* `ptr::write_volatile`
* The `#[deprecated]` attribute
* `OpenOptions::create_new`

Deprecated

* `std::raw::Slice` - use raw parts of `slice` module instead
* `std::raw::Repr` - use raw parts of `slice` module instead
* `str::char_range_at` - use slicing plus `chars()` plus `len_utf8`
* `str::char_range_at_reverse` - use slicing plus `chars().rev()` plus `len_utf8`
* `str::char_at` - use slicing plus `chars()`
* `str::char_at_reverse` - use slicing plus `chars().rev()`
* `str::slice_shift_char` - use `chars()` plus `Chars::as_str`
* `CommandExt::session_leader` - use `before_exec` instead.

Closes #27719
cc #27751 (deprecating the `Slice` bits)
Closes #27754
Closes #27780
Closes #27809
Closes #27811
Closes #27830
Closes #28050
Closes #29453
Closes #29791
Closes #29935
Closes #30014
Closes #30752
Closes #31262
cc #31398 (still need to deal with `before_exec`)
Closes #31405
Closes #31572
Closes #31755
Closes #31756
2016-04-11 08:57:53 -07:00
Ulrik Sverdrup
f0a1ea27cc Add test for is_char_boundary 2016-04-10 20:09:26 +02:00
Alex Crichton
48d5fe9ec5 std: Change encode_utf{8,16} to return iterators
Currently these have non-traditional APIs which take a buffer and report how
much was filled in, but they're not necessarily ergonomic to use. Returning an
iterator which *also* exposes an underlying slice shouldn't result in any
performance loss as it's just a lazy version of the same implementation, and
it's also much more ergonomic!

cc #27784
2016-03-22 10:25:30 -07:00
Alex Crichton
0d5cfd9117 mk: Distribute fewer TARGET_CRATES
Right now everything in TARGET_CRATES is built by default for all non-fulldeps
tests and is distributed by default for all target standard library packages.
Currenly this includes a number of unstable crates which are rarely used such as
`graphviz` and `rbml`>

This commit trims down the set of `TARGET_CRATES`, moves a number of tests to
`*-fulldeps` as a result, and trims down the dependencies of libtest so we can
distribute fewer crates in the `rust-std` packages.
2016-03-07 13:05:12 -08:00
Ulrik Sverdrup
4594f0f67a Fix panic on string slicing error to truncate the string
The string may be arbitrarily long, but we want to limit the panic
message to a reasonable length. Truncate the string if it is too long
(simply to char boundary).

Also add details to the start <= end message. I think it's ok to flesh
out the code here, since it's in a cold function.
2016-03-05 18:11:52 +01:00
Marvin Löbel
dd67e55c10 Changed std::pattern::Pattern impl on &'a &'a str to &'a &'b str
in order to allow a bit more felixibility in how to use it.
2016-03-01 17:53:51 +01:00
Tobias Bucher
b27b8f63bc Add tests for Cow::from for strings, vectors and slices 2016-02-03 20:45:30 +01:00
bors
e7e4ecc522 Auto merge of #30740 - bluss:ascii-is-the-best, r=brson
Add fast path for ASCII in UTF-8 validation

This speeds up the ASCII case (and long stretches of ASCII in otherwise
mixed UTF-8 data) when checking UTF-8 validity.

Benchmark results suggest that on purely ASCII input, we can improve
throughput (megabytes verified / second) by a factor of 13 to 14 (smallish input).
On XML and mostly English language input (en.wikipedia XML dump),
throughput improves by a factor 7 (large input).

On mostly non-ASCII input, performance increases slightly or is the
same.

The UTF-8 validation is rewritten to use indexed access; since all
access is preceded by a (mandatory for validation) length check, bounds
checks are statically elided by LLVM and this formulation is in fact the best
for performance. A previous version had losses due to slice to iterator
conversions.

A large credit to Björn Steinbrink who improved this patch immensely,
writing this second version.

Benchmark results on x86-64 (Sandy Bridge) compiled with -C opt-level=3.

Old code is `regular`, this PR is called `fast`.

Datasets:

- `ascii` is just ASCII (2.5 kB)
- `cyr` is cyrillic script with ascii spaces (5 kB)
- `dewik10` is 10MB of a de.wikipedia XML dump
- `enwik8` is 100MB of an en.wikipedia XML dump
- `jawik10` is 10MB of a ja.wikipedia XML dump

```
test from_utf8_ascii_fast        ... bench:         140 ns/iter (+/- 4) = 18221 MB/s
test from_utf8_ascii_regular     ... bench:       1,932 ns/iter (+/- 19) = 1320 MB/s
test from_utf8_cyr_fast          ... bench:      10,025 ns/iter (+/- 245) = 511 MB/s
test from_utf8_cyr_regular       ... bench:      10,944 ns/iter (+/- 795) = 468 MB/s
test from_utf8_dewik10_fast      ... bench:   6,017,909 ns/iter (+/- 105,755) = 1740 MB/s
test from_utf8_dewik10_regular   ... bench:  11,669,493 ns/iter (+/- 264,045) = 891 MB/s
test from_utf8_enwik8_fast       ... bench:  14,085,692 ns/iter (+/- 1,643,316) = 7000 MB/s
test from_utf8_enwik8_regular    ... bench:  93,657,410 ns/iter (+/- 5,353,353) = 1000 MB/s
test from_utf8_jawik10_fast      ... bench:  29,154,073 ns/iter (+/- 4,659,534) = 340 MB/s
test from_utf8_jawik10_regular   ... bench:  29,112,917 ns/iter (+/- 2,475,123) = 340 MB/s
```

Co-authored-by: Björn Steinbrink <bsteinbr@gmail.com>
2016-01-16 01:18:48 +00:00
Ulrik Sverdrup
11e3de39d9 Add fast path for ASCII in UTF-8 validation
This speeds up the ascii case (and long stretches of ascii in otherwise
mixed UTF-8 data) when checking UTF-8 validity.

Benchmark results suggest that on purely ASCII input, we can improve
throughput (megabytes verified / second) by a factor of 13 to 14!
On xml and mostly english language input (en.wikipedia xml dump),
throughput increases by a factor 7.

On mostly non-ASCII input, performance increases slightly or is the
same.

The UTF-8 validation is rewritten to use indexed access; since all
access is preceded by a (mandatory for validation) length check, they
are statically elided by llvm and this formulation is in fact the best
for performance. A previous version had losses due to slice to iterator
conversions.

A large credit to Björn Steinbrink who improved this patch immensely,
writing this second version.

Benchmark results on x86-64 (Sandy Bridge) compiled with -C opt-level=3.

Old code is `regular`, this PR is called `fast`.

Datasets:

- `ascii` is just ascii (2.5 kB)
- `cyr` is cyrillic script with ascii spaces (5 kB)
- `dewik10` is 10MB of a de.wikipedia xml dump
- `enwik10` is 100MB of an en.wikipedia xml dump
- `jawik10` is 10MB of a ja.wikipedia xml dump

```
test from_utf8_ascii_fast        ... bench:         140 ns/iter (+/- 4) = 18221 MB/s
test from_utf8_ascii_regular     ... bench:       1,932 ns/iter (+/- 19) = 1320 MB/s
test from_utf8_cyr_fast          ... bench:      10,025 ns/iter (+/- 245) = 511 MB/s
test from_utf8_cyr_regular       ... bench:      12,250 ns/iter (+/- 437) = 418 MB/s
test from_utf8_dewik10_fast      ... bench:   6,017,909 ns/iter (+/- 105,755) = 1740 MB/s
test from_utf8_dewik10_regular   ... bench:  11,669,493 ns/iter (+/- 264,045) = 891 MB/s
test from_utf8_enwik8_fast       ... bench:  14,085,692 ns/iter (+/- 1,643,316) = 7000 MB/s
test from_utf8_enwik8_regular    ... bench:  93,657,410 ns/iter (+/- 5,353,353) = 1000 MB/s
test from_utf8_jawik10_fast      ... bench:  29,154,073 ns/iter (+/- 4,659,534) = 340 MB/s
test from_utf8_jawik10_regular   ... bench:  29,112,917 ns/iter (+/- 2,475,123) = 340 MB/s
```

Co-authored-by: Björn Steinbrink <bsteinbr@gmail.com>
2016-01-12 21:57:04 +01:00
William Throwe
e7f3d6eddd Let str::replace take a pattern
It appears this was left out of RFC #528 because it might be useful to
also generalize the second argument in some way.  That doesn't seem to
prevent generalizing the first argument now, however.

This is a [breaking-change] because it could cause type-inference to
fail where it previously succeeded.
2015-12-07 22:08:33 -05:00
Kevin Butler
83b308e585 Add assertions to test_total_ord for str 2015-10-24 19:53:42 +01:00
Kevin Butler
49c78789ce Remove unnecessary String allocations from str tests 2015-10-24 19:53:33 +01:00
Cristi Cobzarenco
4b308b44e1 typos: fix a grabbag of typos all over the place 2015-10-08 19:49:31 +01:00
Alex Crichton
d5f2d3b177 std: Update MatchIndices to return a subslice
This commit updates the `MatchIndices` and `RMatchIndices` iterators to follow
the same pattern as the `chars` and `char_indices` iterators. The `matches`
iterator currently yield `&str` elements, so the `MatchIndices` iterator now
yields the index of the match as well as the `&str` that matched (instead of
start/end indexes).

cc #27743
2015-09-25 09:29:23 -07:00
Alex Crichton
48615a68fb std: Account for CRLF in {str, BufRead}::lines
This commit is an implementation of [RFC 1212][rfc] which tweaks the behavior of
the `str::lines` and `BufRead::lines` iterators. Both iterators now account for
`\r\n` sequences in addition to `\n`, allowing for less surprising behavior
across platforms (especially in the `BufRead` case). Splitting *only* on the
`\n` character can still be achieved with `split('\n')` in both cases.

The `str::lines_any` function is also now deprecated as `str::lines` is a
drop-in replacement for it.

[rfc]: https://github.com/rust-lang/rfcs/blob/master/text/1212-line-endings.md

Closes #28032
2015-09-03 23:01:41 -07:00
bors
b0f77ba26a Auto merge of #28101 - ijks:24214-str-bytes, r=alexcrichton
Specifically, `count`, `last`, and `nth` are implemented to use the
methods of the underlying slice iterator.

Partially closes #24214.
2015-08-31 09:15:55 +00:00
Daan Rijks
dacf2725ec Add overrides to iterator methods for str::Bytes
Specifically, `count`, `last`, and `nth` are implemented to use the
methods of the underlying slice iterator.

Partially closes #24214.
2015-08-30 17:32:50 +02:00
bors
de67d62c6b Auto merge of #27474 - bluss:twoway-reverse, r=brson
StrSearcher: Implement the complete reverse case for the two way algorithm

Fix quadratic behavior in StrSearcher in reverse search with periodic
needles.

This commit adds the missing pieces for the "short period" case in
reverse search. The short case will show up when the needle is literally
periodic, for example "abababab".

Two way uses a "critical factorization" of the needle: x = u v.

Searching matches v first, if mismatch at character k, skip k forward.
Matching u, if mismatch, skip period(x) forward.

To avoid O(mn) behavior after mismatch in u, memorize the already
matched prefix.

The short period case requires that |u| < period(x).

For the reverse search we need to compute a different critical
factorization x = u' v' where |v'| < period(x), because we are searching
for the reversed needle. A short v' also benefits the algorithm in
general.

The reverse critical factorization is computed quickly by using the same
maximal suffix algorithm, but terminating as soon as we have a location
with local period equal to period(x).

This adds extra fields crit_pos_back and memory_back for the reverse
case. The new overhead for TwoWaySearcher::new is low, and additionally
I think the "short period" case is uncommon in many applications of
string search.

The maximal_suffix methods were updated in documentation and the
algorithms updated to not use !0 and wrapping add, variable left is now
1 larger, offset 1 smaller.

Use periodicity when computing byteset: in the periodic case, just
iterate over one period instead of the whole needle.

Example before (rfind) after (twoway_rfind) benchmark shows the removal
of quadratic behavior.

needle: "ab" * 100, haystack: ("bb" + "ab" * 100) * 100

```
test periodic::rfind           ... bench:   1,926,595 ns/iter (+/- 11,390) = 10 MB/s
test periodic::twoway_rfind    ... bench:      51,740 ns/iter (+/- 66) = 386 MB/s
```
2015-08-18 02:02:57 +00:00
bors
e2bebf32fa Auto merge of #27696 - bluss:into-boxed-str, r=alexcrichton
Rename String::into_boxed_slice -> into_boxed_str

This is the name that was decided in rust-lang/rfcs#1152, and it's
better if we say “boxed str” for `Box<str>`.

The old name `String::into_boxed_slice` is deprecated.
2015-08-14 01:06:37 +00:00
Ulrik Sverdrup
bec64090a7 Rename String::into_boxed_slice -> into_boxed_str
This is the name that was decided in rust-lang/rfcs#1152, and it's
better if we say “boxed str” for `Box<str>`.

The old name `String::into_boxed_slice` is deprecated.
2015-08-13 14:02:00 +02:00
Alex Crichton
8d90d3f368 Remove all unstable deprecated functionality
This commit removes all unstable and deprecated functions in the standard
library. A release was recently cut (1.3) which makes this a good time for some
spring cleaning of the deprecated functions.
2015-08-12 14:55:17 -07:00
Ulrik Sverdrup
c5a1d8c3db StrSearcher: Add tests for rfind(&str)
Add tests for .rfind(&str), using the reverse searcher case for
substring search.
2015-08-02 20:08:35 +02:00
Alexis Beingessner
3e954a8cb2 implement Clone for Box<str>, closes #27323
This is a minor [breaking-change], as it changes what
`boxed_str.to_owned()` does (previously it would deref to `&str` and
call `to_owned` on that to get a `String`). However `Box<str>` is such an
exceptionally rare type that this is not expected to be a serious
concern. Also a `Box<str>` can be freely converted to a `String` to
obtain the previous behaviour anyway.
2015-07-29 18:43:01 -07:00
bors
dd46cf8b22 Auto merge of #26241 - SimonSapin:derefmut-for-string, r=alexcrichton
See https://github.com/rust-lang/rfcs/issues/1157
2015-07-13 23:47:06 +00:00
Simon Sapin
7469914e96 Add str::split_at_mut 2015-07-13 16:21:43 +02:00
bors
05d8767289 Auto merge of #26957 - wesleywiser:rename_connect_to_join, r=alexcrichton
Fixes #26900
2015-07-12 22:05:59 +00:00
Jonathan Reem
69521affbb Add String::into_boxed_slice and Box<str>::into_string
Implements merged RFC 1152.

Closes #26697.
2015-07-11 21:31:56 -07:00
Wesley Wiser
93ddee6cee Change some instances of .connect() to .join() 2015-07-10 19:40:46 -04:00
Ulrik Sverdrup
b890b7bbc7 StrSearcher: Update substring search to use the Two Way algorithm
To improve our substring search performance, revive the two way searcher
and adapt it to the Pattern API.

Fixes #25483, a performance bug: that particular case now completes faster
in optimized rust than in ruby (but they share the same order of magnitude).

Much thanks to @gereeter who helped me understand the reverse case
better and wrote the comment explaining `next_back` in the code.

I had quickcheck to fuzz test forward and reverse searching thoroughly.

The two way searcher implements both forward and reverse search,
but not double ended search. The forward and reverse parts of the two
way searcher are completely independent.

The two way searcher algorithm has very small, constant space overhead,
requiring no dynamic allocation. Our implementation is relatively fast,
especially due to the `byteset` addition to the algorithm, which speeds
up many no-match cases.

A bad case for the two way algorithm is:

```
let haystack = (0..10_000).map(|_| "dac").collect::<String>();
let needle = (0..100).map(|_| "bac").collect::<String>());
```

For this particular case, two way is not much faster than the naive
implementation it replaces.
2015-06-21 19:58:50 +02:00
Alex Crichton
ce1a965cf5 Fallout in tests and docs from feature renamings 2015-06-17 09:07:16 -07:00
bors
fbb13543fc Auto merge of #25839 - bluss:str-split-at-impl, r=alexcrichton
Implement RFC rust-lang/rfcs#1123

Add str method str::split_at(mid: usize) -> (&str, &str).

Also a minor cleanup in the collections::str module. Remove redundant slicing of self.
2015-06-11 00:22:27 +00:00
Ulrik Sverdrup
d43bf53948 Add str::split_at
Implement RFC rust-lang/rfcs#1123

Add str method str::split_at(mid: usize) -> (&str, &str).
2015-06-10 09:15:07 +02:00
bors
f06e026578 Auto merge of #26039 - SimonSapin:case-mapping, r=alexcrichton
* Add “complex” mappings to `char::to_lowercase` and `char::to_uppercase`, making them yield sometimes more than on `char`: #25800. `str::to_lowercase` and `str::to_uppercase` are affected as well.
* Add `char::to_titlecase`, since it’s the same algorithm (just different data). However this does **not** add `str::to_titlecase`, as that would require UAX#29 Unicode Text Segmentation which we decided not to include in of `std`: https://github.com/rust-lang/rfcs/pull/1054 I made `char::to_titlecase` immediately `#[stable]`, since it’s so similar to `char::to_uppercase` that’s already stable. Let me know if it should be `#[unstable]` for a while.
* Add a special case for upper-case Sigma in word-final position in `str::to_lowercase`: #26035. This is the only language-independent conditional mapping currently in `SpecialCasing.txt`.
* Stabilize `str::to_lowercase` and `str::to_uppercase`. The `&self -> String` on `str` signature seems straightforward enough, and the only relevant issue I’ve found is #24536 about naming. But `char` already has stable methods with the same name, and deprecating them for a rename doesn’t seem worth it.

r? @alexcrichton
2015-06-09 20:00:32 +00:00
Simon Sapin
c57a4124ff Address a review comment and fix a bootstrapping issue 2015-06-08 19:50:28 +02:00
Simon Sapin
c160192f5f Replace usage of String::from_str with String:from 2015-06-08 16:55:35 +02:00
Simon Sapin
f901086b0d Correctly map upper-case Sigma to lower-case in word-final position. Fix #26035. 2015-06-06 12:37:11 +02:00
kwantam
c361e13d71 implement rfc 1054: split_whitespace() fn, deprecate words()
For now, words() is left in (but deprecated), and Words is a type alias for
struct SplitWhitespace.

Also cleaned up references to s.words() throughout codebase.

Closes #15628
2015-04-21 15:31:51 -04:00
kwantam
29d1252e4d deprecate Unicode functions that will be moved to crates.io
This patch
1. renames libunicode to librustc_unicode,
2. deprecates several pieces of libunicode (see below), and
3. removes references to deprecated functions from
   librustc_driver and libsyntax. This may change pretty-printed
   output from these modules in cases involving wide or combining
   characters used in filenames, identifiers, etc.

The following functions are marked deprecated:

1. char.width() and str.width():
   --> use unicode-width crate

2. str.graphemes() and str.grapheme_indices():
   --> use unicode-segmentation crate

3. str.nfd_chars(), str.nfkd_chars(), str.nfc_chars(), str.nfkc_chars(),
   char.compose(), char.decompose_canonical(), char.decompose_compatible(),
   char.canonical_combining_class():
   --> use unicode-normalization crate
2015-04-16 17:03:05 -04:00
Alex Crichton
f329030b09 std: Stabilize the Utf8Error type
The meaning of each variant of this enum was somewhat ambiguous and it's uncler
that we wouldn't even want to add more enumeration values in the future. As a
result this error has been altered to instead become an opaque structure.
Learning about the "first invalid byte index" is still an unstable feature, but
the type itself is now stable.
2015-04-10 16:07:46 -07:00
bors
dd6c4a8f15 Auto merge of #23293 - tbu-:pr_additive_multiplicative, r=alexcrichton
Previously it could not be implemented for types outside `libcore/iter.rs` due
to coherence issues.
2015-04-08 00:42:10 +00:00
Tobias Bucher
97f24a8596 Make sum and product inherent methods on Iterator
In addition to being nicer, this also allows you to use `sum` and `product` for
iterators yielding custom types aside from the standard integers.

Due to removing the `AdditiveIterator` and `MultiplicativeIterator` trait, this
is a breaking change.

[breaking-change]
2015-04-08 00:26:35 +02:00
Marvin Löbel
fbba28e246 Added smoke tests for new methods.
Fixed bug in existing StrSearcher impl
2015-04-05 18:52:58 +02:00
Marvin Löbel
c29559d28a Moved coretest::str tests into collectiontest::str 2015-04-05 18:52:58 +02:00
Alex Crichton
e98dce3e00 std: Changing the meaning of the count to splitn
This commit is an implementation of [RFC 979][rfc] which changes the meaning of
the count parameter to the `splitn` function on strings and slices. The
parameter now means the number of items that are returned from the iterator, not
the number of splits that are made.

[rfc]: https://github.com/rust-lang/rfcs/pull/979

Closes #23911
[breaking-change]
2015-04-01 13:29:42 -07:00