Simplify a few functions in rustc_data_structures
- drop `try!()` where it's superfluous
- change `try!()` to `?`
- squash a `push` with `push_str`
- refactor a push loop into an iterator
Rollup of bare_trait_objects PRs
All deny attributes were moved into bootstrap so they can be disabled with a line of config.
Warnings for external tools are allowed and it's up to the tool's maintainer to keep it warnings free.
r? @Mark-Simulacrum
cc @ljedrz @kennytm
Speed up `SparseBitMatrix` use in `RegionValues`.
In practice, these matrices range from 10% to 90%+ full once they are
filled in, so the dense representation is better.
This reduces the runtime of Check Nll builds of `inflate` by 32%, and
several other benchmarks by 1--5%.
It also increases max-rss of `clap-rs` by 30% and a couple of others by
up to 5%, while decreasing max-rss of `coercions` by 14%. I think the
speed-ups justify the max-rss increases.
r? @nikomatsakis
Using a `BTreeMap` to represent rows in the bit matrix is really slow.
This patch changes things so that each row is represented by a
`BitVector`. This is a less sparse representation, but a much faster
one.
As a result, `SparseBitSet` and `SparseChunk` can be removed.
Other minor changes in this patch.
- It renames `BitVector::insert()` as `merge()`, which matches the
terminology in the other classes in bitvec.rs.
- It removes `SparseBitMatrix::is_subset()`, which is unused.
- It reinstates `RegionValueElements::num_elements()`, which #52190 had
removed.
- It removes a low-value `debug!` call in `SparseBitMatrix::add()`.
Avoid most allocations in `Canonicalizer`.
Extra allocations are a significant cost of NLL, and the most common
ones come from within `Canonicalizer`. In particular, `canonical_var()`
contains this code:
indices
.entry(kind)
.or_insert_with(|| {
let cvar1 = variables.push(info);
let cvar2 = var_values.push(kind);
assert_eq!(cvar1, cvar2);
cvar1
})
.clone()
`variables` and `var_values` are `Vec`s. `indices` is a `HashMap` used
to track what elements have been inserted into `var_values`. If `kind`
hasn't been seen before, `indices`, `variables` and `var_values` all get
a new element. (The number of elements in each container is always the
same.) This results in lots of allocations.
In practice, most of the time these containers only end up holding a few
elements. This PR changes them to avoid heap allocations in the common
case, by changing the `Vec`s to `SmallVec`s and only using `indices`
once enough elements are present. (When the number of elements is small,
a direct linear search of `var_values` is as good or better than a
hashmap lookup.)
The changes to `variables` are straightforward and contained within
`Canonicalizer`. The changes to `indices` are more complex but also
contained within `Canonicalizer`. The changes to `var_values` are more
intrusive because they require defining a new type
`SmallCanonicalVarValues` -- which is to `CanonicalVarValues` as
`SmallVec` is to `Vec -- and passing stack-allocated values of that type
in from outside.
All this speeds up a number of NLL "check" builds, the best by 2%.
r? @nikomatsakis
`BitSlice` fixes
`propagate_bits_into_entry_set_for` and `BitSlice::bitwise` are hot for some benchmarks under NLL. I tried and failed to speed them up. (Increasing the size of `bit_slice::Word` from `usize` to `u128` caused a slowdown, even though decreasing the size of `bitvec::Word` from `u128` to `u64` also caused a slowdown. Weird.)
Anyway, along the way I fixed up several problems in and around the `BitSlice` code.
r? @nikomatsakis
Extra allocations are a significant cost of NLL, and the most common
ones come from within `Canonicalizer`. In particular, `canonical_var()`
contains this code:
indices
.entry(kind)
.or_insert_with(|| {
let cvar1 = variables.push(info);
let cvar2 = var_values.push(kind);
assert_eq!(cvar1, cvar2);
cvar1
})
.clone()
`variables` and `var_values` are `Vec`s. `indices` is a `HashMap` used
to track what elements have been inserted into `var_values`. If `kind`
hasn't been seen before, `indices`, `variables` and `var_values` all get
a new element. (The number of elements in each container is always the
same.) This results in lots of allocations.
In practice, most of the time these containers only end up holding a few
elements. This PR changes them to avoid heap allocations in the common
case, by changing the `Vec`s to `SmallVec`s and only using `indices`
once enough elements are present. (When the number of elements is small,
a direct linear search of `var_values` is as good or better than a
hashmap lookup.)
The changes to `variables` are straightforward and contained within
`Canonicalizer`. The changes to `indices` are more complex but also
contained within `Canonicalizer`. The changes to `var_values` are more
intrusive because they require defining a new type
`SmallCanonicalVarValues` -- which is to `CanonicalVarValues` as
`SmallVec` is to `Vec -- and passing stack-allocated values of that type
in from outside.
All this speeds up a number of NLL "check" builds, the best by 2%.
nll experiment: compute SCCs instead of iterative region solving
This is an attempt to speed up region solving by replacing the current iterative dataflow with a SCC computation. The idea is to detect cycles (SCCs) amongst region constraints and then compute just one value per cycle. The graph with all cycles removed is of course a DAG, so we can then solve constraints "bottom up" once the liveness values are known.
I kinda ran out of time this morning so the last commit is a bit sloppy but I wanted to get this posted, let travis run on it, and maybe do a perf run, before I clean it up.
The strategy is this:
- we compute SCCs once all outlives constraints are known
- we allocate a set of values **per region** for storing liveness
- we allocate a set of values **per SCC** for storing the final values
- when we add a liveness constraint to the region R, we also add it
to the final value of the SCC to which R belongs
- then we can apply the constraints by just walking the DAG for the
SCCs and union'ing the children (which have their liveness
constraints within)
There are a few intermediate refactorings that I really ought to have
broken out into their own commits:
- reverse the constraint graph so that `R1: R2` means `R1 -> R2` and
not `R2 -> R1`. This fits better with the SCC computation and new
style of inference (`->` now means "take value from" and not "push
value into")
- this does affect some of the UI tests, since they traverse the
graph, but mostly the artificial ones and they don't necessarily
seem worse
- put some things (constraint set, etc) into `Rc`. This lets us root
them to permit mutation and iteration. It also guarantees they don't
change, which is critical to the correctness of the algorithm.
- Generalize various helpers that previously operated only on points
to work on any sort of region element.
In multiple ways:
- Two calls to `bits_to_string()` passed in byte lengths rather than bit
lengths, which meant only 1/8th of the `BitSlice` was printed.
- `bit_str`'s purpose is entirely mysterious. I removed it and changed
its callers to print the indices in the obvious way.
- `bits_to_string`'s inner loop was totally wrong, such that it printed
entirely bogus results.
- `bits_to_string` now also adds a '|' between words, which makes the
output easier to read, e.g.:
`[ff-ff-ff-ff-ff-ff-ff-ff|ff-ff-ff-ff-ff-ff-ff-07]`.
Currently `Word` is `usize`, and there are various places in the code
that assume this.
This patch mostly just changes `usize` occurrences to `Word`. Most of
the changes were found as compile errors when I changed `Word` to a type
other than `usize`, but there was one non-obvious case in
librustc_mir/dataflow/mod.rs that caused bounds check failures before I
fixed it.
The current situation is something of a mess.
- `IdxSetBuf` derefs to `IdxSet`.
- `IdxSetBuf` implements `Clone`, and therefore has a provided `clone_from`
method, which does allocation and so is expensive.
- `IdxSet` has a `clone_from` method that is non-allocating and therefore
cheap, but this method is not from the `Clone` trait.
As a result, if you have an `IdxSetBuf` called `b`, if you call
`b.clone_from(b2)` you'll get the expensive `IdxSetBuf` method, but if you call
`(*b).clone_from(b2)` you'll get the cheap `IdxSetBuf` method.
`liveness_of_locals()` does the former, presumably unintentionally, and
therefore does lots of unnecessary allocations.
Having a `clone_from` method that isn't from the `Clone` trait is a bad idea in
general, so this patch renames it as `overwrite`. This avoids the unnecessary
allocations in `liveness_of_locals()`, speeding up most NLL benchmarks, the
best by 1.5%. It also means that calls of the form `(*b).clone_from(b2)` can be
rewritten as `b.overwrite(b2)`.
Obligation forest cleanup
While looking at this code I was scratching my head about whether a node could appear in both `parent` and `dependents`. Turns out it can, but it's not useful to do so, so this PR cleans things up so it's no longer possible.
Improve memoization and refactor NLL type check
I have a big branch that is refactoring NLL type check with the goal of introducing canonicalization-based memoization for all of the operations it does. This PR contains an initial prefix of that branch which, I believe, stands alone. It does introduce a few smaller optimizations of its own:
- Skip operations that are trivially a no-op
- Cache the results of the dropck-outlives computations done by liveness
- Skip resetting unifications if nothing changed
r? @pnkfelix
Speed up obligation forest code
Here are the rustc-perf benchmarks that get at least a 1% speedup on one or more of their runs with these patches applied:
```
inflate-check
avg: -8.7% min: -12.1% max: 0.0%
inflate
avg: -5.9% min: -8.6% max: 1.1%
inflate-opt
avg: -1.5% min: -2.0% max: -0.3%
clap-rs-check
avg: -0.6% min: -1.9% max: 0.5%
coercions
avg: -0.2%? min: -1.3%? max: 0.6%?
serde-opt
avg: -0.6% min: -1.0% max: 0.1%
coercions-check
avg: -0.4%? min: -1.0%? max: -0.0%?
```