move checking ptr tracking on item pop into cold helper function
Before:
```
Benchmark 1: cargo miri run --manifest-path bench-cargo-miri/serde1/Cargo.toml
Time (mean ± σ): 6.729 s ± 0.050 s [User: 6.608 s, System: 0.124 s]
Range (min … max): 6.665 s … 6.799 s 5 runs
Benchmark 2: cargo miri run --manifest-path bench-cargo-miri/unicode/Cargo.toml
Time (mean ± σ): 20.923 s ± 0.271 s [User: 20.386 s, System: 0.537 s]
Range (min … max): 20.580 s … 21.165 s 5 runs
```
After:
```
Benchmark 1: cargo miri run --manifest-path bench-cargo-miri/serde1/Cargo.toml
Time (mean ± σ): 6.562 s ± 0.023 s [User: 6.430 s, System: 0.135 s]
Range (min … max): 6.544 s … 6.594 s 5 runs
Benchmark 2: cargo miri run --manifest-path bench-cargo-miri/unicode/Cargo.toml
Time (mean ± σ): 20.375 s ± 0.228 s [User: 19.964 s, System: 0.413 s]
Range (min … max): 20.201 s … 20.736 s 5 runs
```
Nothing major, but we'll take it I guess. 🤷
Fixes https://github.com/rust-lang/miri/issues/2132
Optimizing Stacked Borrows (part 2): Shrink Item
This moves protectors out of `Item`, storing them both in a global `HashSet` which contains all currently-protected tags as well as a `Vec<SbTag>` on each `Frame` so that when we return from a function we know which tags to remove from the protected set.
This also bit-packs the 64-bit tag and the 2-bit permission together when they are stored in memory. This means we theoretically run out of tags sooner, but I doubt that limit will ever be hit.
Together these optimizations reduce the memory footprint of Miri when executing programs which stress Stacked Borrows by ~66%. For example, running a test with isolation off which only panics currently peaks at ~19 GB, with this PR it peaks at ~6.2 GB.
To-do
- [x] Enforce the 62-bit limit
- [x] Decide if there is a better order to pack the tag and permission in
- [x] Wait for `UnsafeCell` to become infectious, or express offsets + tags in the global protector set
Benchmarks before:
```
Benchmark 1: cargo +miri miri run --manifest-path bench-cargo-miri/backtraces/Cargo.toml
Time (mean ± σ): 8.948 s ± 0.253 s [User: 8.752 s, System: 0.158 s]
Range (min … max): 8.619 s … 9.279 s 5 runs
Benchmark 1: cargo +miri miri run --manifest-path bench-cargo-miri/mse/Cargo.toml
Time (mean ± σ): 2.129 s ± 0.037 s [User: 1.849 s, System: 0.248 s]
Range (min … max): 2.086 s … 2.176 s 5 runs
Benchmark 1: cargo +miri miri run --manifest-path bench-cargo-miri/serde1/Cargo.toml
Time (mean ± σ): 3.334 s ± 0.017 s [User: 3.211 s, System: 0.103 s]
Range (min … max): 3.315 s … 3.352 s 5 runs
Benchmark 1: cargo +miri miri run --manifest-path bench-cargo-miri/serde2/Cargo.toml
Time (mean ± σ): 3.316 s ± 0.038 s [User: 3.207 s, System: 0.095 s]
Range (min … max): 3.282 s … 3.375 s 5 runs
Benchmark 1: cargo +miri miri run --manifest-path bench-cargo-miri/unicode/Cargo.toml
Time (mean ± σ): 6.391 s ± 0.323 s [User: 5.928 s, System: 0.412 s]
Range (min … max): 6.090 s … 6.917 s 5 runs
```
After:
```
Benchmark 1: cargo +miri miri run --manifest-path bench-cargo-miri/backtraces/Cargo.toml
Time (mean ± σ): 6.955 s ± 0.051 s [User: 6.807 s, System: 0.132 s]
Range (min … max): 6.900 s … 7.038 s 5 runs
Benchmark 1: cargo +miri miri run --manifest-path bench-cargo-miri/mse/Cargo.toml
Time (mean ± σ): 1.784 s ± 0.012 s [User: 1.627 s, System: 0.156 s]
Range (min … max): 1.772 s … 1.797 s 5 runs
Benchmark 1: cargo +miri miri run --manifest-path bench-cargo-miri/serde1/Cargo.toml
Time (mean ± σ): 2.505 s ± 0.095 s [User: 2.311 s, System: 0.096 s]
Range (min … max): 2.405 s … 2.603 s 5 runs
Benchmark 1: cargo +miri miri run --manifest-path bench-cargo-miri/serde2/Cargo.toml
Time (mean ± σ): 2.449 s ± 0.031 s [User: 2.306 s, System: 0.100 s]
Range (min … max): 2.395 s … 2.467 s 5 runs
Benchmark 1: cargo +miri miri run --manifest-path bench-cargo-miri/unicode/Cargo.toml
Time (mean ± σ): 3.667 s ± 0.110 s [User: 3.498 s, System: 0.140 s]
Range (min … max): 3.564 s … 3.814 s 5 runs
```
The decrease in system time is probably due to spending less time in the page fault handler.
stacked_borrow now has an item module, and its own FrameExtra. These
serve to protect the implementation of Item (which is a bunch of
bit-packing tricks) from the primary logic of Stacked Borrows, and the
FrameExtra we have separates Stacked Borrows more cleanly from the
interpreter itself.
The new strategy for checking protectors also makes some subtle
performance tradeoffs, so they are now documented in Stack::item_popped
because that function primarily benefits from them, and it also touches
every aspect of them.
Also separating the actual CallId that is protecting a Tag from the Tag
makes it inconvienent to reproduce exactly the same protector errors, so
this also takes the opportunity to use some slightly cleaner English in
those errors. We need to make some change, might as well make it good.
Previously, Item was a struct of a NonZeroU64, an Option which was
usually unset or irrelevant, and a 4-variant enum. So collectively, the
size of an Item was 24 bytes, but only 8 bytes were used for the most
part.
So this takes advantage of the fact that it is probably impossible to
exhaust the total space of SbTags, and steals 3 bits from it to pack the
whole struct into a single u64. This bit-packing means that we reduce
peak memory usage when Miri goes memory-bound by ~3x. We also get CPU
performance improvements of varying size, because not only are we simply
accessing less memory, we can now compare a Vec<Item> using a memcmp
because it does not have any padding.
handle Box with allocators
This is the Miri side of https://github.com/rust-lang/rust/pull/98847.
Thanks `@DrMeepster` for doing most of the work of getting this test case to pass in Miri. :)
Optimizing Stacked Borrows (part 1?): Cache locations of Tags in a Borrow Stack
Before this PR, a profile of Miri under almost any workload points quite squarely at these regions of code as being incredibly hot (each being ~40% of cycles):
dadcbebfbd/src/stacked_borrows.rs (L259-L269)dadcbebfbd/src/stacked_borrows.rs (L362-L369)
This code is one of at least three reasons that stacked borrows analysis is super-linear: These are both linear in the number of borrows in the stack and they are positioned along the most commonly-taken paths.
I'm addressing the first loop (which is in `Stack::find_granting`) by adding a very very simple sort of LRU cache implemented on a `VecDeque`, which maps recently-looked-up tags to their position in the stack. For `Untagged` access we fall back to the same sort of linear search. But as far as I can tell there are never enough `Untagged` items to be significant.
I'm addressing the second loop by keeping track of the region of stack where there could be items granting `Permission::Unique`. This optimization is incredibly effective because `Read` access tends to dominate and many trips through this code path now skip the loop entirely.
These optimizations result in pretty enormous improvements:
Without raw pointer tagging, `mse` 34.5s -> 2.4s, `serde1` 5.6s -> 3.6s
With raw pointer tagging, `mse` 35.3s -> 2.4s, `serde1` 5.7s -> 3.6s
And there is hardly any impact on memory usage:
Memory usage on `mse` 844 MB -> 848 MB, `serde1` 184 MB -> 184 MB (jitter on these is a few MB).
Support (stat/fstat/lstat)64 on macos
"In order to accommodate advanced capabilities of newer file systems,
the struct stat, struct statfs, and struct dirent data structures
were updated in Mac OSX 10.5."
"TRANSITIONAL DESCRIPTION (NOW DEPRECATED)
The fstat64, lstat64 and stat64 routines are equivalent to their
corresponding non-64-suffixed routine, when 64-bit inodes are in
effect. They were added before there was support for the symbol
variants, and so are now deprecated. Instead of using these, set
the _DARWIN_USE_64_BIT_INODE macro before including header files to
force 64-bit inode support. The stat64 structure used by these deprecated routines is the same
as the stat structure when 64-bit inodes are in effect (see above)."
"HISTORY
An lstat() function call appeared in 4.2BSD. The stat64(),
fstat64(), and lstat64() system calls first appeared in Mac OS X
10.5 (Leopard) and are now deprecated in favor of the corresponding
symbol variants. The fstatat() system call appeared in OS X 10.10"
"In order to accommodate advanced capabilities of newer file systems,
the struct stat, struct statfs, and struct dirent data structures
were updated in Mac OSX 10.5."
"TRANSITIONAL DESCRIPTION (NOW DEPRECATED)
The fstat64, lstat64 and stat64 routines are equivalent to their
corresponding non-64-suffixed routine, when 64-bit inodes are in
effect. They were added before there was support for the symbol
variants, and so are now deprecated. Instead of using these, set
the _DARWIN_USE_64_BIT_INODE macro before including header files to
force 64-bit inode support.
The stat64 structure used by these deprecated routines is the same
as the stat structure when 64-bit inodes are in effect (see above)."
"HISTORY
An lstat() function call appeared in 4.2BSD. The stat64(),
fstat64(), and lstat64() system calls first appeared in Mac OS X
10.5 (Leopard) and are now deprecated in favor of the corresponding
symbol variants. The fstatat() system call appeared in OS X 10.10"
This adds a very simple LRU-like cache which stores the locations of
often-used tags. While the implementation is very simple, the cache hit
rate is incredible at ~99.9% on most programs, and often the element at
position 0 in the cache has a hit rate of 90%. So the sub-optimality of
this cache basicaly vanishes into the noise in a profile.
Additionally, we keep a range which denotes where there might be an item
granting Unique permission in the stack, so that when we invalidate
Uniques we do not need to scan much of the stack, and often scan nothing
at all.