Add benchmark and fast path for BufReader::read_exact
At work, we have a wrapper type that implements this optimization. It would be nice if the standard library were faster.
Before:
```
test io::buffered::tests::bench_buffered_reader_small_reads ... bench: 7,670 ns/iter (+/- 45)
```
After:
```
test io::buffered::tests::bench_buffered_reader_small_reads ... bench: 4,457 ns/iter (+/- 41)
```