2014-08-01 18:31:22 -05:00
|
|
|
% The Guide to Rust Strings
|
2014-07-17 18:24:12 -05:00
|
|
|
|
|
|
|
Strings are an important concept to master in any programming language. If you
|
|
|
|
come from a managed language background, you may be surprised at the complexity
|
|
|
|
of string handling in a systems programming language. Efficient access and
|
|
|
|
allocation of memory for a dynamically sized structure involves a lot of
|
|
|
|
details. Luckily, Rust has lots of tools to help us here.
|
|
|
|
|
|
|
|
A **string** is a sequence of unicode scalar values encoded as a stream of
|
|
|
|
UTF-8 bytes. All strings are guaranteed to be validly-encoded UTF-8 sequences.
|
|
|
|
Additionally, strings are not null-terminated and can contain null bytes.
|
|
|
|
|
|
|
|
Rust has two main types of strings: `&str` and `String`.
|
|
|
|
|
2014-07-18 09:48:24 -05:00
|
|
|
# &str
|
2014-07-17 18:24:12 -05:00
|
|
|
|
2014-10-22 17:44:17 -05:00
|
|
|
The first kind is a `&str`. This is pronounced a 'string slice'.
|
|
|
|
String literals are of the type `&str`:
|
2014-07-17 18:24:12 -05:00
|
|
|
|
|
|
|
```{rust}
|
|
|
|
let string = "Hello there.";
|
|
|
|
```
|
|
|
|
|
|
|
|
Like any Rust type, string slices have an associated lifetime. A string literal
|
|
|
|
is a `&'static str`. A string slice can be written without an explicit
|
|
|
|
lifetime in many cases, such as in function arguments. In these cases the
|
|
|
|
lifetime will be inferred:
|
|
|
|
|
|
|
|
```{rust}
|
|
|
|
fn takes_slice(slice: &str) {
|
|
|
|
println!("Got: {}", slice);
|
|
|
|
}
|
|
|
|
```
|
|
|
|
|
|
|
|
Like vector slices, string slices are simply a pointer plus a length. This
|
|
|
|
means that they're a 'view' into an already-allocated string, such as a
|
|
|
|
`&'static str` or a `String`.
|
|
|
|
|
2014-07-18 09:48:24 -05:00
|
|
|
# String
|
2014-07-17 18:24:12 -05:00
|
|
|
|
|
|
|
A `String` is a heap-allocated string. This string is growable, and is also
|
|
|
|
guaranteed to be UTF-8.
|
|
|
|
|
|
|
|
```{rust}
|
|
|
|
let mut s = "Hello".to_string();
|
|
|
|
println!("{}", s);
|
|
|
|
|
|
|
|
s.push_str(", world.");
|
|
|
|
println!("{}", s);
|
|
|
|
```
|
|
|
|
|
|
|
|
You can coerce a `String` into a `&str` with the `as_slice()` method:
|
|
|
|
|
|
|
|
```{rust}
|
|
|
|
fn takes_slice(slice: &str) {
|
|
|
|
println!("Got: {}", slice);
|
|
|
|
}
|
|
|
|
|
|
|
|
fn main() {
|
|
|
|
let s = "Hello".to_string();
|
|
|
|
takes_slice(s.as_slice());
|
|
|
|
}
|
|
|
|
```
|
|
|
|
|
|
|
|
You can also get a `&str` from a stack-allocated array of bytes:
|
|
|
|
|
|
|
|
```{rust}
|
|
|
|
use std::str;
|
|
|
|
|
|
|
|
let x: &[u8] = &[b'a', b'b'];
|
|
|
|
let stack_str: &str = str::from_utf8(x).unwrap();
|
|
|
|
```
|
|
|
|
|
2014-07-18 09:48:24 -05:00
|
|
|
# Best Practices
|
2014-07-17 18:24:12 -05:00
|
|
|
|
2014-07-18 09:48:24 -05:00
|
|
|
## `String` vs. `&str`
|
2014-07-17 18:24:12 -05:00
|
|
|
|
|
|
|
In general, you should prefer `String` when you need ownership, and `&str` when
|
|
|
|
you just need to borrow a string. This is very similar to using `Vec<T>` vs. `&[T]`,
|
|
|
|
and `T` vs `&T` in general.
|
|
|
|
|
|
|
|
This means starting off with this:
|
|
|
|
|
|
|
|
```{rust,ignore}
|
|
|
|
fn foo(s: &str) {
|
|
|
|
```
|
|
|
|
|
|
|
|
and only moving to this:
|
|
|
|
|
|
|
|
```{rust,ignore}
|
|
|
|
fn foo(s: String) {
|
|
|
|
```
|
|
|
|
|
|
|
|
If you have good reason. It's not polite to hold on to ownership you don't
|
2014-08-28 13:05:33 -05:00
|
|
|
need, and it can make your lifetimes more complex.
|
|
|
|
|
|
|
|
## Generic functions
|
|
|
|
|
2014-09-22 14:55:55 -05:00
|
|
|
To write a function that's generic over types of strings, use `&str`.
|
2014-08-28 13:05:33 -05:00
|
|
|
|
|
|
|
```{rust}
|
2014-09-22 14:55:55 -05:00
|
|
|
fn some_string_length(x: &str) -> uint {
|
|
|
|
x.len()
|
2014-08-28 13:05:33 -05:00
|
|
|
}
|
|
|
|
|
|
|
|
fn main() {
|
|
|
|
let s = "Hello, world";
|
|
|
|
|
|
|
|
println!("{}", some_string_length(s));
|
|
|
|
|
|
|
|
let s = "Hello, world".to_string();
|
|
|
|
|
2014-09-22 14:55:55 -05:00
|
|
|
println!("{}", some_string_length(s.as_slice()));
|
2014-08-28 13:05:33 -05:00
|
|
|
}
|
|
|
|
```
|
|
|
|
|
|
|
|
Both of these lines will print `12`.
|
|
|
|
|
2014-07-18 09:48:24 -05:00
|
|
|
## Comparisons
|
2014-07-17 18:24:12 -05:00
|
|
|
|
|
|
|
To compare a String to a constant string, prefer `as_slice()`...
|
|
|
|
|
|
|
|
```{rust}
|
2014-10-22 17:44:17 -05:00
|
|
|
fn compare(x: String) {
|
|
|
|
if x.as_slice() == "Hello" {
|
2014-07-17 18:24:12 -05:00
|
|
|
println!("yes");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
```
|
|
|
|
|
|
|
|
... over `to_string()`:
|
|
|
|
|
|
|
|
```{rust}
|
2014-10-22 17:44:17 -05:00
|
|
|
fn compare(x: String) {
|
|
|
|
if x == "Hello".to_string() {
|
2014-07-17 18:24:12 -05:00
|
|
|
println!("yes");
|
|
|
|
}
|
|
|
|
}
|
|
|
|
```
|
|
|
|
|
|
|
|
Converting a `String` to a `&str` is cheap, but converting the `&str` to a
|
|
|
|
`String` involves an allocation.
|
|
|
|
|
2014-08-28 12:56:55 -05:00
|
|
|
## Indexing strings
|
|
|
|
|
|
|
|
You may be tempted to try to access a certain character of a `String`, like
|
|
|
|
this:
|
|
|
|
|
|
|
|
```{rust,ignore}
|
|
|
|
let s = "hello".to_string();
|
|
|
|
|
|
|
|
println!("{}", s[0]);
|
|
|
|
```
|
|
|
|
|
|
|
|
This does not compile. This is on purpose. In the world of UTF-8, direct
|
|
|
|
indexing is basically never what you want to do. The reason is that each
|
2014-08-28 13:05:33 -05:00
|
|
|
character can be a variable number of bytes. This means that you have to iterate
|
2014-11-15 17:00:47 -06:00
|
|
|
through the characters anyway, which is an O(n) operation.
|
2014-08-28 12:56:55 -05:00
|
|
|
|
2014-09-22 14:55:55 -05:00
|
|
|
There's 3 basic levels of unicode (and its encodings):
|
|
|
|
|
|
|
|
- code units, the underlying data type used to store everything
|
|
|
|
- code points/unicode scalar values (char)
|
|
|
|
- graphemes (visible characters)
|
|
|
|
|
|
|
|
Rust provides iterators for each of these situations:
|
|
|
|
|
|
|
|
- `.bytes()` will iterate over the underlying bytes
|
|
|
|
- `.chars()` will iterate over the code points
|
|
|
|
- `.graphemes()` will iterate over each grapheme
|
|
|
|
|
|
|
|
Usually, the `graphemes()` method on `&str` is what you want:
|
2014-08-28 12:56:55 -05:00
|
|
|
|
|
|
|
```{rust}
|
2014-09-22 14:55:55 -05:00
|
|
|
let s = "u͔n͈̰̎i̙̮͚̦c͚̉o̼̩̰͗d͔̆̓ͥé";
|
2014-08-28 12:56:55 -05:00
|
|
|
|
|
|
|
for l in s.graphemes(true) {
|
|
|
|
println!("{}", l);
|
|
|
|
}
|
|
|
|
```
|
|
|
|
|
2014-09-22 14:55:55 -05:00
|
|
|
This prints:
|
|
|
|
|
2014-12-07 03:18:56 -06:00
|
|
|
```{text}
|
2014-09-22 14:55:55 -05:00
|
|
|
u͔
|
|
|
|
n͈̰̎
|
|
|
|
i̙̮͚̦
|
|
|
|
c͚̉
|
|
|
|
o̼̩̰͗
|
|
|
|
d͔̆̓ͥ
|
|
|
|
é
|
|
|
|
```
|
|
|
|
|
2014-08-28 13:05:33 -05:00
|
|
|
Note that `l` has the type `&str` here, since a single grapheme can consist of
|
|
|
|
multiple codepoints, so a `char` wouldn't be appropriate.
|
|
|
|
|
2014-09-22 14:55:55 -05:00
|
|
|
This will print out each visible character in turn, as you'd expect: first "u͔", then
|
|
|
|
"n͈̰̎", etc. If you wanted each individual codepoint of each grapheme, you can use `.chars()`:
|
2014-08-28 12:56:55 -05:00
|
|
|
|
|
|
|
```{rust}
|
2014-09-22 14:55:55 -05:00
|
|
|
let s = "u͔n͈̰̎i̙̮͚̦c͚̉o̼̩̰͗d͔̆̓ͥé";
|
|
|
|
|
|
|
|
for l in s.chars() {
|
|
|
|
println!("{}", l);
|
|
|
|
}
|
|
|
|
```
|
|
|
|
|
|
|
|
This prints:
|
|
|
|
|
2014-12-07 03:18:56 -06:00
|
|
|
```{text}
|
2014-09-22 14:55:55 -05:00
|
|
|
u
|
|
|
|
͔
|
|
|
|
n
|
|
|
|
̎
|
|
|
|
͈
|
|
|
|
̰
|
|
|
|
i
|
|
|
|
̙
|
|
|
|
̮
|
|
|
|
͚
|
|
|
|
̦
|
|
|
|
c
|
|
|
|
̉
|
|
|
|
͚
|
|
|
|
o
|
|
|
|
͗
|
|
|
|
̼
|
|
|
|
̩
|
|
|
|
̰
|
|
|
|
d
|
|
|
|
̆
|
|
|
|
̓
|
|
|
|
ͥ
|
|
|
|
͔
|
|
|
|
e
|
|
|
|
́
|
|
|
|
```
|
|
|
|
|
|
|
|
You can see how some of them are combining characters, and therefore the output
|
|
|
|
looks a bit odd.
|
|
|
|
|
|
|
|
If you want the individual byte representation of each codepoint, you can use
|
|
|
|
`.bytes()`:
|
|
|
|
|
|
|
|
```{rust}
|
|
|
|
let s = "u͔n͈̰̎i̙̮͚̦c͚̉o̼̩̰͗d͔̆̓ͥé";
|
2014-08-28 12:56:55 -05:00
|
|
|
|
2014-08-28 13:05:33 -05:00
|
|
|
for l in s.bytes() {
|
2014-08-28 12:56:55 -05:00
|
|
|
println!("{}", l);
|
|
|
|
}
|
|
|
|
```
|
|
|
|
|
|
|
|
This will print:
|
|
|
|
|
2014-12-07 03:18:56 -06:00
|
|
|
```{text}
|
2014-09-22 14:55:55 -05:00
|
|
|
117
|
|
|
|
205
|
|
|
|
148
|
|
|
|
110
|
|
|
|
204
|
|
|
|
142
|
|
|
|
205
|
|
|
|
136
|
|
|
|
204
|
2014-08-28 12:56:55 -05:00
|
|
|
176
|
2014-09-22 14:55:55 -05:00
|
|
|
105
|
|
|
|
204
|
|
|
|
153
|
|
|
|
204
|
2014-08-28 12:56:55 -05:00
|
|
|
174
|
2014-09-22 14:55:55 -05:00
|
|
|
205
|
|
|
|
154
|
|
|
|
204
|
|
|
|
166
|
|
|
|
99
|
|
|
|
204
|
|
|
|
137
|
|
|
|
205
|
|
|
|
154
|
|
|
|
111
|
|
|
|
205
|
|
|
|
151
|
|
|
|
204
|
|
|
|
188
|
|
|
|
204
|
|
|
|
169
|
|
|
|
204
|
|
|
|
176
|
|
|
|
100
|
|
|
|
204
|
|
|
|
134
|
|
|
|
205
|
|
|
|
131
|
|
|
|
205
|
|
|
|
165
|
|
|
|
205
|
|
|
|
148
|
|
|
|
101
|
|
|
|
204
|
2014-08-28 12:56:55 -05:00
|
|
|
129
|
|
|
|
```
|
|
|
|
|
|
|
|
Many more bytes than graphemes!
|
|
|
|
|
2014-07-18 09:48:24 -05:00
|
|
|
# Other Documentation
|
2014-07-17 18:24:12 -05:00
|
|
|
|
2014-10-12 11:51:11 -05:00
|
|
|
* [the `&str` API documentation](std/str/index.html)
|
2014-07-17 18:24:12 -05:00
|
|
|
* [the `String` API documentation](std/string/index.html)
|