Sync portable-simd to remove autosplats
This PR syncs portable-simd in up to a8385522ad in order to address the type inference breakages documented on nightly in https://github.com/rust-lang/rust/issues/90904 by removing the vector + scalar binary operations (called "autosplats", "broadcasting", or "rank promotion", depending on who you ask) that allow `{scalar} + &'_ {scalar}` to fail in some cases, because it becomes possible the programmer may have meant `{scalar} + &'_ {vector}`.
A few quality-of-life improvements make their way in as well:
- Lane counts can now go to 64, as LLVM seems to have fixed their miscompilation for those.
- `{i,u}8x64` to `__m512i` is now available.
- a bunch of `#[must_use]` notes appear throughout the module.
- Some implementations, mostly instances of `impl core::ops::{Op}<Simd> for Simd` that aren't `{vector} + {vector}` (e.g. `{vector} + &'_ {vector}`), leverage some generics and `where` bounds now to make them easier to understand by reducing a dozen implementations into one (and make it possible for people to open the docs on less burly devices).
- And some internal-only improvements.
None of these changes should affect a beta backport, only actual users of `core::simd` (and most aren't even visible in the programmatic sense), though I can extract an even more minimal changeset for beta if necessary. It seemed simpler to just keep moving forward.
Avoid string validation in rustc_serialize, check a marker byte instead
Since the serialization format isn't self-describing we need a way to detect when encoder and decoder don't match up. But for strings it doesn't have to be utf8 validation, which currently does cost a few percent of performance.
Instead we can use a marker byte at the end to be reasonably sure that we're dealing with a string and it wasn't overwritten in some way.
Support AVR for inline asm!
A first pass at support for the AVR platform in inline `asm!`. Passes the initial compiler tests, have not yet done more complete verification.
In particular, the register classes could use a lot more fleshing out, this draft PR so far only includes the most basic.
cc `@Amanieu` `@dylanmckay`
Update books
## nomicon
1 commits in c6b4bf831e9a40aec34f53067d20634839a6778b..49681ea4a9fa81173dbe9ffed74b4d4a35eae9e3
2021-11-09 02:30:56 +0900 to 2021-11-24 16:27:28 +0900
- Clarify that drop flag fields only apply to older Rust versions (rust-lang/nomicon#324)
## reference
2 commits in c0f222da23568477155991d391c9ce918e381351..954f3d441ad880737a13e241108f791a4d2a38cd
2021-11-22 10:30:57 -0800 to 2021-11-29 11:11:30 -0800
- Say that bare trait objects are rejected in the 2021 edition (rust-lang/reference#1111)
- Update 'Subtyping and Variance' example to use `dyn Trait` syntax (rust-lang/reference#1110)
## book
5 commits in a5e0c5b2c5f9054be3b961aea2c7edfeea591de8..5f9358faeb1f46e19b8a23a21e79fd7fe150491e
2021-11-19 17:06:19 -0500 to 2021-12-05 21:33:16 -0500
- 1.57
- Update to 1.56
- Snapshot of ch 11 for nostarch
- Clarify how to check for an error in tests returning Result
- Update book repo links for default branch rename
## rust-by-example
1 commits in 43f82530210b83cf888282b207ed13d5893da9b2..1ca6a7bd1d73edc4a3e6c7d6a40f5d4b66c1e517
2021-11-21 22:31:50 -0300 to 2021-11-23 17:48:53 -0300
- Removed `u32` at the end of ints (rust-lang/rust-by-example#1477)
## rustc-dev-guide
10 commits in a2fc9635029c04e692474965a6606f8e286d539a..a374e7d8bb6b79de45b92295d06b4ac0ef35bc09
2021-11-18 13:31:13 -0500 to 2021-12-03 09:26:47 -0800
- Update LLVM coverage mapping format version supported by rustc (rust-lang/rustc-dev-guide#1267)
- Improve 'Running tests manually' section
- Fix some links
- Update for review comments.
- Document rustfix-only-machine-applicable
- Apply suggestions from pierwill
- Document more compiletest headers.
- make it compile with 1.56.0 no warning
- make it compile with 1.56.0
- make it compile with 1.56.0
## edition-guide
1 commits in 8e0ec8c77d8b28b86159fdee9d33a758225ecf9c..beea0a3cdc3885375342fd010f9ad658e6a5e09a
2021-11-12 06:30:23 -0800 to 2021-12-05 07:06:45 -0800
- Fix typo (neccesary -> necessary) (rust-lang/edition-guide#274)
Suggest try_reserve in try_reserve_exact
During developing #91529 , I found that `try_reserve_exact` suggests `reserve` for further insertions. I think it's a mistake by copy&paste, `try_reserve` is better here.
Remove a dead code path.
It is neither documented nor can I see any way it could ever be reached.
Also, no tests fail when turning that arm into an ICE
Add `array::IntoIter::{empty, from_raw_parts}`
`array::IntoIter` has a bunch of really handy logic for dealing with partial arrays, but it's currently hamstrung by only being creatable from a fully-initialized array.
This PR adds two new constructors:
- a safe & const `empty`, since `[].into_iter()` can only give `IntoIter<T, 0>`, not `IntoIter<T, N>`.
- an unsafe `from_raw_parts`, to allow experimentation with new uses.
(Slice & vec iterators don't need `from_raw_parts` because you `from_raw_parts` the slice or vec instead, but there's no useful way to made a `<[T; N]>::from_raw_parts`, so I think this is a reasonable place to have one.)
Fix AnonConst ICE
I am not sure if this is even the correct place to fix this issue, but i went down the path where the generic args came from and i wasn't able to find a clear cause for this down there. But if anybody has a suggestion what i should do, just tell me.
This fixes: https://github.com/rust-lang/rust/issues/91267
Add test for evaluate_obligation: Ok(EvaluatedToOkModuloRegions) ICE
Adds the minimial repro test case from #85360. The fix for #85360 was
supposed to be #85868 however the repro was resolved in the 2021-07-05
nightly while #85868 didn't land until 2021-09-03. The reason for that
is d34a3a401b **also** resolves that
issue.
To test if #85868 actually fixes#85360, I reverted
d34a3a401b and found that #85868 does
indeed resolve#85360.
With that question resolved, add a test case to our incremental test
suite for the original Ok(EvaluatedToOkModuloRegions) ICE.
Thanks to ````@lqd```` for helping track this down!
We already use the object crate for generating uncompressed .rmeta
metadata object files. This switches the generation of compressed
.rustc object files to use the object crate as well. These have
slightly different requirements in that .rmeta should be completely
excluded from any final compilation artifacts, while .rustc should
be part of shared objects, but not loaded into memory.
The primary motivation for this change is #90326: In LLVM 14, the
current way of setting section flags (and in particular, preventing
the setting of SHF_ALLOC) will no longer work. There are other ways
we could work around this, but switching to the object crate seems
like the most elegant, as we already use it for .rmeta, and as it
makes this independent of the codegen backend. In particular, we
don't need separate handling in codegen_llvm and codegen_gcc.
codegen_cranelift should be able to reuse the implementation as
well, though I have omitted that here, as it is not based on
codegen_ssa.
This change mostly extracts the existing code for .rmeta handling
to allow using it for .rustc as well, and adjust the codegen
infrastructure to handle the metadata object file separately: We
no longer create a backend-specific module for it, and directly
produce the compiled module instead.
This does not fix#90326 by itself yet, as .llvmbc will need to be
handled separately.
Replace dominators algorithm with simple Lengauer-Tarjan
This PR replaces our dominators implementation with that of the simple Lengauer-Tarjan algorithm, which is (to my knowledge and research) the currently accepted 'best' algorithm. The more complex variant has higher constant time overheads, and Semi-NCA (which is arguably a variant of Lengauer-Tarjan too) is not the preferred variant by the first paper cited in the documentation comments: simple Lengauer-Tarjan "is less sensitive to pathological instances, we think it should be preferred where performance guarantees are important" - which they are for us.
This work originally arose from noting that the keccak benchmark spent a considerable portion of its time (both instructions and cycles) in the dominator computations, which sparked an interest in potentially optimizing that code. The current algorithm largely proves slow on long "parallel" chains where the nearest common ancestor lookup (i.e., the intersect function) does not quickly identify a root; it is also inherently a pointer-chasing algorithm so is relatively slow on modern CPUs due to needing to hit memory - though usually in cache - in a tight loop, which still costs several cycles.
This was replaced with a bitset-based algorithm, previously studied in literature but implemented directly from dataflow equations in our case, which proved to be a significant speed up on the keccak benchmark: 20% instruction count wins, as can be seen in [this performance report](https://perf.rust-lang.org/compare.html?start=377d1a984cd2a53327092b90aa1d8b7e22d1e347&end=542da47ff78aa462384062229dad0675792f2638). This algorithm is also relatively simple in comparison to other algorithms and is easy to understand. However, these performance results showed a regression on a number of other benchmarks, and I was unable to get the bitsets to perform well enough that those regressions could be fully mitigated. The implementation "attempt" is seen here in the first commit, and is intended to be kept primarily so that future optimizers do not repeat that path (or can easily refer to the attempt).
The final version of this PR chooses the simple Lengauer-Tarjan algorithm, and implements it along with a number of optimizations found in literature. The current implementation is a slight improvement for many benchmarks, with keccak still being an outlier at ~20%. The implementation in this PR first implements the most basic variant of the algorithm directly from the pseudocode on page 16, physical, or 28 in the PDF of the first paper ("Linear-Time Algorithms for Dominators and Related Problems"). This is then followed by a number of commits which update the implementation to apply various performance improvements, as suggested by the paper. Finally, the last commit annotates the implementation with a number of comments, mostly drawn from the paper, which intend to help readers understand what is going on - these are incomplete without the paper, but writing them certainly helped my understanding. They may be helpful if future optimization attempts are attempted, so I chose to add them in.
When we point at a binding to suggest giving it a type, erase all the
type for ADTs that have been resolved, leaving only the ones that could
not be inferred. For small shallow types this is not a problem, but for
big nested types with lots of params, this can otherwise cause a lot of
unnecessary visual output.
Update Clippy
Since RLS is now already broken #91543 , we shouldn't be blocked by it anymore. I plan to do the RLS update once new rustc-ap packages are released.
r? `@Manishearth`
This largely avoids remapping from and to the 'real' indices, with the exception
of predecessor lookup and the final merge back, and is conceptually better.
As the paper indicates, the unprocessed vertices in the DFS tree and processed
vertices are disjoint, and we can use them in the same space, tracking only the index
of the split.
This replaces the previous implementation with the simple variant of
Lengauer-Tarjan, which performs better in the general case. Performance on the
keccak benchmark is about equivalent between the two, but we don't see
regressions (and indeed see improvements) on other benchmarks, even on a
partially optimized implementation.
The implementation here follows that of the pseudocode in "Linear-Time
Algorithms for Dominators and Related Problems" thesis by Loukas Georgiadis. The
next few commits will optimize the implementation as suggested in the thesis.
Several related works are cited in the comments within the implementation, as
well.
Implement the simple Lengauer-Tarjan algorithm
This replaces the previous implementation (from #34169), which has not been
optimized since, with the simple variant of Lengauer-Tarjan which performs
better in the general case. A previous attempt -- not kept in commit history --
attempted a replacement with a bitset-based implementation, but this led to
regressions on perf.rust-lang.org benchmarks and equivalent wins for the keccak
benchmark, so was rejected.
The implementation here follows that of the pseudocode in "Linear-Time
Algorithms for Dominators and Related Problems" thesis by Loukas Georgiadis. The
next few commits will optimize the implementation as suggested in the thesis.
Several related works are cited in the comments within the implementation, as
well.
On the keccak benchmark, we were previously spending 15% of our cycles computing
the NCA / intersect function; this function is quite expensive, especially on
modern CPUs, as it chases pointers on every iteration in a tight loop. With this
commit, we spend ~0.05% of our time in dominator computation.