This change is needed for compiler-builtins to check for this feature
when implementing memcpy/memset. See:
https://github.com/rust-lang/compiler-builtins/pull/365
The change just does compile-time detection. I think that runtime
detection will have to come in a follow-up CL to std-detect.
Like all the CPU feature flags, this just references #44839
Signed-off-by: Joe Richey <joerichey@google.com>
Update affected ui & incremental tests to use a user declared variable
bindings instead of temporaries. The former are preserved because of
debuginfo, the latter are not.
The simplify locals implementation uses two different visitors to update
the locals use counts. The DeclMarker calculates the initial use counts.
The StatementDeclMarker updates the use counts as statements are being
removed from the block.
Replace them with a single visitor that can operate in either mode,
ensuring consistency of behaviour.
Additionally use exhaustive match to clarify what is being optimized.
No functional changes intended.
Optimise align_offset for stride=1 further
`stride == 1` case can be computed more efficiently through `-p (mod
a)`. That, then translates to a nice and short sequence of LLVM
instructions:
%address = ptrtoint i8* %p to i64
%negptr = sub i64 0, %address
%offset = and i64 %negptr, %a_minus_one
And produces pretty much ideal code-gen when this function is used in
isolation.
Typical use of this function will, however, involve use of
the result to offset a pointer, i.e.
%aligned = getelementptr inbounds i8, i8* %p, i64 %offset
This still looks very good, but LLVM does not really translate that to
what would be considered ideal machine code (on any target). For example
that's the codegen we obtain for an unknown alignment:
; x86_64
dec rsi
mov rax, rdi
neg rax
and rax, rsi
add rax, rdi
In particular negating a pointer is not something that’s going to be
optimised for in the design of CISC architectures like x86_64. They
are much better at offsetting pointers. And so we’d love to utilize this
ability and produce code that's more like this:
; x86_64
lea rax, [rsi + rdi - 1]
neg rsi
and rax, rsi
To achieve this we need to give LLVM an opportunity to apply its
various peep-hole optimisations that it does during DAG selection. In
particular, the `and` instruction appears to be a major inhibitor here.
We cannot, sadly, get rid of this load-bearing operation, but we can
reorder operations such that LLVM has more to work with around this
instruction.
One such ordering is proposed in #75579 and results in LLVM IR that
looks broadly like this:
; using add enables `lea` and similar CISCisms
%offset_ptr = add i64 %address, %a_minus_one
%mask = sub i64 0, %a
%masked = and i64 %offset_ptr, %mask
; can be folded with `gepi` that may follow
%offset = sub i64 %masked, %address
…and generates the intended x86_64 machine code.
One might also wonder how the increased amount of code would impact a
RISC target. Turns out not much:
; aarch64 previous ; aarch64 new
sub x8, x1, #1 add x8, x1, x0
neg x9, x0 sub x8, x8, #1
and x8, x9, x8 neg x9, x1
add x0, x0, x8 and x0, x8, x9
(and similarly for ppc, sparc, mips, riscv, etc)
The only target that seems to do worse is… wasm32.
Onto actual measurements – the best way to evaluate snipets like these
is to use llvm-mca. Much like Aarch64 assembly would allow to suspect,
there isn’t any performance difference to be found. Both snippets
execute in same number of cycles for the CPUs I tried. On x86_64,
we get throughput improvement of >50%!
Fixes#75579
Rollup of 10 pull requests
Successful merges:
- #74477 (`#[deny(unsafe_op_in_unsafe_fn)]` in sys/wasm)
- #77836 (transmute_copy: explain that alignment is handled correctly)
- #78126 (Properly define va_arg and va_list for aarch64-apple-darwin)
- #78137 (Initialize tracing subscriber in compiletest tool)
- #78161 (Add issue template link to IRLO)
- #78214 (Tweak match arm semicolon removal suggestion to account for futures)
- #78247 (Fix#78192)
- #78252 (Add codegen test for #45964)
- #78268 (Do not try to report on closures to avoid ICE)
- #78295 (Add some regression tests)
Failed merges:
r? `@ghost`
Tweak match arm semicolon removal suggestion to account for futures
* Tweak and extend "use `.await`" suggestions
* Suggest removal of semicolon on prior match arm
* Account for `impl Future` when suggesting semicolon removal
* Silence some errors when encountering `await foo()?` as can't be certain what the intent was
*Thanks to https://twitter.com/a_hoverbear/status/1318960787105353728 for pointing this out!*
Initialize tracing subscriber in compiletest tool
The logging in compiletest was migrated from log crate to a tracing, but
the initialization code was never changed, so logging is non-functional.
Initialize tracing subscriber using default settings.
Properly define va_arg and va_list for aarch64-apple-darwin
From [Apple][]:
> Because of these changes, the type `va_list` is an alias for `char*`,
> and not for the struct type in the generic procedure call standard.
With this change `/x.py test --stage 1 src/test/ui/abi/variadic-ffi`
passes.
Fixes#78092
[Apple]: https://developer.apple.com/documentation/xcode/writing_arm64_code_for_apple_platforms
transmute_copy: explain that alignment is handled correctly
The doc comment currently is somewhat misleading because if it actually transmuted `&T` to `&U`, a higher-aligned `U` would be problematic.
`#[deny(unsafe_op_in_unsafe_fn)]` in sys/wasm
This is part of #73904.
This encloses unsafe operations in unsafe fn in `libstd/sys/wasm`.
@rustbot modify labels: F-unsafe-block-in-unsafe-fn
The associated_items(def_id) call
allocates internally.
Previously, we'd have called it for
each pair, so we'd have had O(n^2)
many calls. By precomputing the
associated items, we avoid
repeating so many allocations.
The only instance where this precomputation
would be a regression is if there's only
one inherent impl block for the type,
as the inner loop then doesn't run.
In that instance, we just early return.
Also, use SmallVec to avoid doing an
allocation at all if the number is small
(the case for most impl blocks out there).
Hex bin digit grouping
This revives and updates an old pr (#3391) for the current API.
Closes#2538.
---
*Please keep the line below*
changelog: Add [`unusual_byte_groupings`] lint.