Auto merge of #89582 - jkugelman:optimize-file-read-to-end, r=joshtriplett

Optimize File::read_to_end and read_to_string

Reading a file into an empty vector or string buffer can incur unnecessary `read` syscalls and memory re-allocations as the buffer "warms up" and grows to its final size. This is perhaps a necessary evil with generic readers, but files can be read in smarter by checking the file size and reserving that much capacity.

`std::fs::read` and `std::fs::read_to_string` already perform this optimization: they open the file, reads its metadata, and call `with_capacity` with the file size. This ensures that the buffer does not need to be resized and an initial string of small `read` syscalls.

However, if a user opens the `File` themselves and calls `file.read_to_end` or `file.read_to_string` they do not get this optimization.

```rust
let mut buf = Vec::new();
file.read_to_end(&mut buf)?;
```

I searched through this project's codebase and even here are a *lot* of examples of this. They're found all over in unit tests, which isn't a big deal, but there are also several real instances in the compiler and in Cargo. I've documented the ones I found in a comment here:

https://github.com/rust-lang/rust/issues/89516#issuecomment-934423999

Most telling, the documentation for both the `Read` trait and the `Read::read_to_end` method both show this exact pattern as examples of how to use readers. What this says to me is that this shouldn't be solved by simply fixing the instances of it in this codebase. If it's here it's certain to be prevalent in the wider Rust ecosystem.

To that end, this commit adds specializations of `read_to_end` and `read_to_string` directly on `File`. This way it's no longer a minor footgun to start with an empty buffer when reading a file in.

A nice side effect of this change is that code that accesses a `File` as `impl Read` or `dyn Read` will benefit. For example, this code from `compiler/rustc_serialize/src/json.rs`:

```rust
pub fn from_reader(rdr: &mut dyn Read) -> Result<Json, BuilderError> {
    let mut contents = Vec::new();
    match rdr.read_to_end(&mut contents) {
```

Related changes:

- I also added specializations to `BufReader` to delegate to `self.inner`'s methods. That way it can call `File`'s optimized  implementations if the inner reader is a file.

- The private `std::io::append_to_string` function is now marked `unsafe`.

- `File::read_to_string` being more efficient means that the performance note for `io::read_to_string` can be softened. I've added `@camelid's` suggested wording from https://github.com/rust-lang/rust/issues/80218#issuecomment-936806502.

r? `@joshtriplett`
This commit is contained in:
bors 2021-10-09 05:24:47 +00:00
commit 910692de74
5 changed files with 146 additions and 46 deletions

View File

@ -198,19 +198,10 @@ pub struct DirBuilder {
recursive: bool, recursive: bool,
} }
/// Indicates how large a buffer to pre-allocate before reading the entire file.
fn initial_buffer_size(file: &File) -> usize {
// Don't worry about `usize` overflow because reading will fail regardless
// in that case.
file.metadata().map(|m| m.len() as usize).unwrap_or(0)
}
/// Read the entire contents of a file into a bytes vector. /// Read the entire contents of a file into a bytes vector.
/// ///
/// This is a convenience function for using [`File::open`] and [`read_to_end`] /// This is a convenience function for using [`File::open`] and [`read_to_end`]
/// with fewer imports and without an intermediate variable. It pre-allocates a /// with fewer imports and without an intermediate variable.
/// buffer based on the file size when available, so it is generally faster than
/// reading into a vector created with [`Vec::new()`].
/// ///
/// [`read_to_end`]: Read::read_to_end /// [`read_to_end`]: Read::read_to_end
/// ///
@ -237,7 +228,7 @@ fn initial_buffer_size(file: &File) -> usize {
pub fn read<P: AsRef<Path>>(path: P) -> io::Result<Vec<u8>> { pub fn read<P: AsRef<Path>>(path: P) -> io::Result<Vec<u8>> {
fn inner(path: &Path) -> io::Result<Vec<u8>> { fn inner(path: &Path) -> io::Result<Vec<u8>> {
let mut file = File::open(path)?; let mut file = File::open(path)?;
let mut bytes = Vec::with_capacity(initial_buffer_size(&file)); let mut bytes = Vec::new();
file.read_to_end(&mut bytes)?; file.read_to_end(&mut bytes)?;
Ok(bytes) Ok(bytes)
} }
@ -247,9 +238,7 @@ fn inner(path: &Path) -> io::Result<Vec<u8>> {
/// Read the entire contents of a file into a string. /// Read the entire contents of a file into a string.
/// ///
/// This is a convenience function for using [`File::open`] and [`read_to_string`] /// This is a convenience function for using [`File::open`] and [`read_to_string`]
/// with fewer imports and without an intermediate variable. It pre-allocates a /// with fewer imports and without an intermediate variable.
/// buffer based on the file size when available, so it is generally faster than
/// reading into a string created with [`String::new()`].
/// ///
/// [`read_to_string`]: Read::read_to_string /// [`read_to_string`]: Read::read_to_string
/// ///
@ -278,7 +267,7 @@ fn inner(path: &Path) -> io::Result<Vec<u8>> {
pub fn read_to_string<P: AsRef<Path>>(path: P) -> io::Result<String> { pub fn read_to_string<P: AsRef<Path>>(path: P) -> io::Result<String> {
fn inner(path: &Path) -> io::Result<String> { fn inner(path: &Path) -> io::Result<String> {
let mut file = File::open(path)?; let mut file = File::open(path)?;
let mut string = String::with_capacity(initial_buffer_size(&file)); let mut string = String::new();
file.read_to_string(&mut string)?; file.read_to_string(&mut string)?;
Ok(string) Ok(string)
} }
@ -615,6 +604,15 @@ fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
} }
} }
/// Indicates how much extra capacity is needed to read the rest of the file.
fn buffer_capacity_required(mut file: &File) -> usize {
let size = file.metadata().map(|m| m.len()).unwrap_or(0);
let pos = file.stream_position().unwrap_or(0);
// Don't worry about `usize` overflow because reading will fail regardless
// in that case.
size.saturating_sub(pos) as usize
}
#[stable(feature = "rust1", since = "1.0.0")] #[stable(feature = "rust1", since = "1.0.0")]
impl Read for File { impl Read for File {
fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> { fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
@ -635,6 +633,18 @@ unsafe fn initializer(&self) -> Initializer {
// SAFETY: Read is guaranteed to work on uninitialized memory // SAFETY: Read is guaranteed to work on uninitialized memory
unsafe { Initializer::nop() } unsafe { Initializer::nop() }
} }
// Reserves space in the buffer based on the file size when available.
fn read_to_end(&mut self, buf: &mut Vec<u8>) -> io::Result<usize> {
buf.reserve(buffer_capacity_required(self));
io::default_read_to_end(self, buf)
}
// Reserves space in the buffer based on the file size when available.
fn read_to_string(&mut self, buf: &mut String) -> io::Result<usize> {
buf.reserve(buffer_capacity_required(self));
io::default_read_to_string(self, buf)
}
} }
#[stable(feature = "rust1", since = "1.0.0")] #[stable(feature = "rust1", since = "1.0.0")]
impl Write for File { impl Write for File {
@ -681,6 +691,18 @@ unsafe fn initializer(&self) -> Initializer {
// SAFETY: Read is guaranteed to work on uninitialized memory // SAFETY: Read is guaranteed to work on uninitialized memory
unsafe { Initializer::nop() } unsafe { Initializer::nop() }
} }
// Reserves space in the buffer based on the file size when available.
fn read_to_end(&mut self, buf: &mut Vec<u8>) -> io::Result<usize> {
buf.reserve(buffer_capacity_required(self));
io::default_read_to_end(self, buf)
}
// Reserves space in the buffer based on the file size when available.
fn read_to_string(&mut self, buf: &mut String) -> io::Result<usize> {
buf.reserve(buffer_capacity_required(self));
io::default_read_to_string(self, buf)
}
} }
#[stable(feature = "rust1", since = "1.0.0")] #[stable(feature = "rust1", since = "1.0.0")]
impl Write for &File { impl Write for &File {

View File

@ -308,6 +308,51 @@ fn is_read_vectored(&self) -> bool {
unsafe fn initializer(&self) -> Initializer { unsafe fn initializer(&self) -> Initializer {
self.inner.initializer() self.inner.initializer()
} }
// The inner reader might have an optimized `read_to_end`. Drain our buffer and then
// delegate to the inner implementation.
fn read_to_end(&mut self, buf: &mut Vec<u8>) -> io::Result<usize> {
let nread = self.cap - self.pos;
buf.extend_from_slice(&self.buf[self.pos..self.cap]);
self.discard_buffer();
Ok(nread + self.inner.read_to_end(buf)?)
}
// The inner reader might have an optimized `read_to_end`. Drain our buffer and then
// delegate to the inner implementation.
fn read_to_string(&mut self, buf: &mut String) -> io::Result<usize> {
// In the general `else` case below we must read bytes into a side buffer, check
// that they are valid UTF-8, and then append them to `buf`. This requires a
// potentially large memcpy.
//
// If `buf` is empty--the most common case--we can leverage `append_to_string`
// to read directly into `buf`'s internal byte buffer, saving an allocation and
// a memcpy.
if buf.is_empty() {
// `append_to_string`'s safety relies on the buffer only being appended to since
// it only checks the UTF-8 validity of new data. If there were existing content in
// `buf` then an untrustworthy reader (i.e. `self.inner`) could not only append
// bytes but also modify existing bytes and render them invalid. On the other hand,
// if `buf` is empty then by definition any writes must be appends and
// `append_to_string` will validate all of the new bytes.
unsafe { crate::io::append_to_string(buf, |b| self.read_to_end(b)) }
} else {
// We cannot append our byte buffer directly onto the `buf` String as there could
// be an incomplete UTF-8 sequence that has only been partially read. We must read
// everything into a side buffer first and then call `from_utf8` on the complete
// buffer.
let mut bytes = Vec::new();
self.read_to_end(&mut bytes)?;
let string = crate::str::from_utf8(&bytes).map_err(|_| {
io::Error::new_const(
io::ErrorKind::InvalidData,
&"stream did not contain valid UTF-8",
)
})?;
*buf += string;
Ok(string.len())
}
}
} }
#[stable(feature = "rust1", since = "1.0.0")] #[stable(feature = "rust1", since = "1.0.0")]

View File

@ -243,6 +243,28 @@ fn seek(&mut self, _: SeekFrom) -> io::Result<u64> {
assert_eq!(reader.buffer().len(), 0); assert_eq!(reader.buffer().len(), 0);
} }
#[test]
fn test_buffered_reader_read_to_end_consumes_buffer() {
let data: &[u8] = &[0, 1, 2, 3, 4, 5, 6, 7];
let mut reader = BufReader::with_capacity(3, data);
let mut buf = Vec::new();
assert_eq!(reader.fill_buf().ok(), Some(&[0, 1, 2][..]));
assert_eq!(reader.read_to_end(&mut buf).ok(), Some(8));
assert_eq!(&buf, &[0, 1, 2, 3, 4, 5, 6, 7]);
assert!(reader.buffer().is_empty());
}
#[test]
fn test_buffered_reader_read_to_string_consumes_buffer() {
let data: &[u8] = "deadbeef".as_bytes();
let mut reader = BufReader::with_capacity(3, data);
let mut buf = String::new();
assert_eq!(reader.fill_buf().ok(), Some("dea".as_bytes()));
assert_eq!(reader.read_to_string(&mut buf).ok(), Some(8));
assert_eq!(&buf, "deadbeef");
assert!(reader.buffer().is_empty());
}
#[test] #[test]
fn test_buffered_writer() { fn test_buffered_writer() {
let inner = Vec::new(); let inner = Vec::new();

View File

@ -316,11 +316,12 @@ fn drop(&mut self) {
} }
} }
// A few methods below (read_to_string, read_line) will append data into a // Several `read_to_string` and `read_line` methods in the standard library will
// `String` buffer, but we need to be pretty careful when doing this. The // append data into a `String` buffer, but we need to be pretty careful when
// implementation will just call `.as_mut_vec()` and then delegate to a // doing this. The implementation will just call `.as_mut_vec()` and then
// byte-oriented reading method, but we must ensure that when returning we never // delegate to a byte-oriented reading method, but we must ensure that when
// leave `buf` in a state such that it contains invalid UTF-8 in its bounds. // returning we never leave `buf` in a state such that it contains invalid UTF-8
// in its bounds.
// //
// To this end, we use an RAII guard (to protect against panics) which updates // To this end, we use an RAII guard (to protect against panics) which updates
// the length of the string when it is dropped. This guard initially truncates // the length of the string when it is dropped. This guard initially truncates
@ -334,11 +335,10 @@ fn drop(&mut self) {
// 2. We're passing a raw buffer to the function `f`, and it is expected that // 2. We're passing a raw buffer to the function `f`, and it is expected that
// the function only *appends* bytes to the buffer. We'll get undefined // the function only *appends* bytes to the buffer. We'll get undefined
// behavior if existing bytes are overwritten to have non-UTF-8 data. // behavior if existing bytes are overwritten to have non-UTF-8 data.
fn append_to_string<F>(buf: &mut String, f: F) -> Result<usize> pub(crate) unsafe fn append_to_string<F>(buf: &mut String, f: F) -> Result<usize>
where where
F: FnOnce(&mut Vec<u8>) -> Result<usize>, F: FnOnce(&mut Vec<u8>) -> Result<usize>,
{ {
unsafe {
let mut g = Guard { len: buf.len(), buf: buf.as_mut_vec() }; let mut g = Guard { len: buf.len(), buf: buf.as_mut_vec() };
let ret = f(g.buf); let ret = f(g.buf);
if str::from_utf8(&g.buf[g.len..]).is_err() { if str::from_utf8(&g.buf[g.len..]).is_err() {
@ -349,7 +349,6 @@ fn append_to_string<F>(buf: &mut String, f: F) -> Result<usize>
g.len = g.buf.len(); g.len = g.buf.len();
ret ret
} }
}
} }
// This uses an adaptive system to extend the vector when it fills. We want to // This uses an adaptive system to extend the vector when it fills. We want to
@ -361,7 +360,7 @@ fn append_to_string<F>(buf: &mut String, f: F) -> Result<usize>
// //
// Because we're extending the buffer with uninitialized data for trusted // Because we're extending the buffer with uninitialized data for trusted
// readers, we need to make sure to truncate that if any of this panics. // readers, we need to make sure to truncate that if any of this panics.
fn read_to_end<R: Read + ?Sized>(r: &mut R, buf: &mut Vec<u8>) -> Result<usize> { pub(crate) fn default_read_to_end<R: Read + ?Sized>(r: &mut R, buf: &mut Vec<u8>) -> Result<usize> {
let start_len = buf.len(); let start_len = buf.len();
let start_cap = buf.capacity(); let start_cap = buf.capacity();
let mut g = Guard { len: buf.len(), buf }; let mut g = Guard { len: buf.len(), buf };
@ -426,6 +425,22 @@ fn read_to_end<R: Read + ?Sized>(r: &mut R, buf: &mut Vec<u8>) -> Result<usize>
} }
} }
pub(crate) fn default_read_to_string<R: Read + ?Sized>(
r: &mut R,
buf: &mut String,
) -> Result<usize> {
// Note that we do *not* call `r.read_to_end()` here. We are passing
// `&mut Vec<u8>` (the raw contents of `buf`) into the `read_to_end`
// method to fill it up. An arbitrary implementation could overwrite the
// entire contents of the vector, not just append to it (which is what
// we are expecting).
//
// To prevent extraneously checking the UTF-8-ness of the entire buffer
// we pass it to our hardcoded `default_read_to_end` implementation which
// we know is guaranteed to only read data into the end of the buffer.
unsafe { append_to_string(buf, |b| default_read_to_end(r, b)) }
}
pub(crate) fn default_read_vectored<F>(read: F, bufs: &mut [IoSliceMut<'_>]) -> Result<usize> pub(crate) fn default_read_vectored<F>(read: F, bufs: &mut [IoSliceMut<'_>]) -> Result<usize>
where where
F: FnOnce(&mut [u8]) -> Result<usize>, F: FnOnce(&mut [u8]) -> Result<usize>,
@ -716,7 +731,7 @@ unsafe fn initializer(&self) -> Initializer {
/// [`std::fs::read`]: crate::fs::read /// [`std::fs::read`]: crate::fs::read
#[stable(feature = "rust1", since = "1.0.0")] #[stable(feature = "rust1", since = "1.0.0")]
fn read_to_end(&mut self, buf: &mut Vec<u8>) -> Result<usize> { fn read_to_end(&mut self, buf: &mut Vec<u8>) -> Result<usize> {
read_to_end(self, buf) default_read_to_end(self, buf)
} }
/// Read all bytes until EOF in this source, appending them to `buf`. /// Read all bytes until EOF in this source, appending them to `buf`.
@ -759,16 +774,7 @@ fn read_to_end(&mut self, buf: &mut Vec<u8>) -> Result<usize> {
/// [`std::fs::read_to_string`]: crate::fs::read_to_string /// [`std::fs::read_to_string`]: crate::fs::read_to_string
#[stable(feature = "rust1", since = "1.0.0")] #[stable(feature = "rust1", since = "1.0.0")]
fn read_to_string(&mut self, buf: &mut String) -> Result<usize> { fn read_to_string(&mut self, buf: &mut String) -> Result<usize> {
// Note that we do *not* call `.read_to_end()` here. We are passing default_read_to_string(self, buf)
// `&mut Vec<u8>` (the raw contents of `buf`) into the `read_to_end`
// method to fill it up. An arbitrary implementation could overwrite the
// entire contents of the vector, not just append to it (which is what
// we are expecting).
//
// To prevent extraneously checking the UTF-8-ness of the entire buffer
// we pass it to our hardcoded `read_to_end` implementation which we
// know is guaranteed to only read data into the end of the buffer.
append_to_string(buf, |b| read_to_end(self, b))
} }
/// Read the exact number of bytes required to fill `buf`. /// Read the exact number of bytes required to fill `buf`.
@ -1005,6 +1011,11 @@ fn take(self, limit: u64) -> Take<Self>
/// need more control over performance, and in those cases you should definitely use /// need more control over performance, and in those cases you should definitely use
/// [`Read::read_to_string`] directly. /// [`Read::read_to_string`] directly.
/// ///
/// Note that in some special cases, such as when reading files, this function will
/// pre-allocate memory based on the size of the input it is reading. In those
/// cases, the performance should be as good as if you had used
/// [`Read::read_to_string`] with a manually pre-allocated buffer.
///
/// # Errors /// # Errors
/// ///
/// This function forces you to handle errors because the output (the `String`) /// This function forces you to handle errors because the output (the `String`)
@ -2201,7 +2212,7 @@ fn read_line(&mut self, buf: &mut String) -> Result<usize> {
// Note that we are not calling the `.read_until` method here, but // Note that we are not calling the `.read_until` method here, but
// rather our hardcoded implementation. For more details as to why, see // rather our hardcoded implementation. For more details as to why, see
// the comments in `read_to_end`. // the comments in `read_to_end`.
append_to_string(buf, |b| read_until(self, b'\n', b)) unsafe { append_to_string(buf, |b| read_until(self, b'\n', b)) }
} }
/// Returns an iterator over the contents of this reader split on the byte /// Returns an iterator over the contents of this reader split on the byte

View File

@ -290,7 +290,7 @@ fn bench_read_to_end(b: &mut test::Bencher) {
b.iter(|| { b.iter(|| {
let mut lr = repeat(1).take(10000000); let mut lr = repeat(1).take(10000000);
let mut vec = Vec::with_capacity(1024); let mut vec = Vec::with_capacity(1024);
super::read_to_end(&mut lr, &mut vec) super::default_read_to_end(&mut lr, &mut vec)
}); });
} }