`#[deny(unsafe_op_in_unsafe_fn)]` in sys/wasm
This is part of #73904.
This encloses unsafe operations in unsafe fn in `libstd/sys/wasm`.
@rustbot modify labels: F-unsafe-block-in-unsafe-fn
revise Hermit's mutex interface to support the behaviour of StaticMutex
rust-lang/rust#77147 simplifies things by splitting this Mutex type into two types matching the two use cases: StaticMutex and MovableMutex. To support the new behavior of StaticMutex, we move part of the mutex implementation into libstd.
The interface to the OS changed. Consequently, I removed a few functions, which aren't longer needed.
If pthread mutex initialization fails, the failure will go unnoticed unless
debug assertions are enabled. Any subsequent use of mutex will also silently
fail, since return values from lock & unlock operations are similarly checked
only through debug assertions.
In some implementations the mutex initialization requires a memory
allocation and so it does fail in practice.
Check that initialization succeeds to ensure that mutex guarantees
mutual exclusion.
Use posix_spawn() on unix if program is a path
Previously `Command::spawn` would fall back to the non-posix_spawn based
implementation if the `PATH` environment variable was possibly changed.
On systems with a modern (g)libc `posix_spawn()` can be significantly
faster. If program is a path itself the `PATH` environment variable is
not used for the lookup and it should be safe to use the
`posix_spawnp()` method. [1]
We found this, because we have a cli application that effectively runs a
lot of subprocesses. It would sometimes noticeably hang while printing
output. Profiling showed that the process was spending the majority of
time in the kernel's `copy_page_range` function while spawning
subprocesses. During this time the process is completely blocked from
running, explaining why users were reporting the cli app hanging.
Through this we discovered that `std::process::Command` has a fast and
slow path for process execution. The fast path is backed by
`posix_spawnp()` and the slow path by fork/exec syscalls being called
explicitly. Using fork for process creation is supposed to be fast, but
it slows down as your process uses more memory. It's not because the
kernel copies the actual memory from the parent, but it does need to
copy the references to it (see `copy_page_range` above!). We ended up
using the slow path, because the command spawn implementation in falls
back to the slow path if it suspects the PATH environment variable was
changed.
Here is a smallish program demonstrating the slowdown before this code
change:
```
use std::process::Command;
use std::time::Instant;
fn main() {
let mut args = std::env::args().skip(1);
if let Some(size) = args.next() {
// Allocate some memory
let _xs: Vec<_> = std::iter::repeat(0)
.take(size.parse().expect("valid number"))
.collect();
let mut command = Command::new("/bin/sh");
command
.arg("-c")
.arg("echo hello");
if args.next().is_some() {
println!("Overriding PATH");
command.env("PATH", std::env::var("PATH").expect("PATH env var"));
}
let now = Instant::now();
let child = command
.spawn()
.expect("failed to execute process");
println!("Spawn took: {:?}", now.elapsed());
let output = child.wait_with_output().expect("failed to wait on process");
println!("Output: {:?}", output);
} else {
eprintln!("Usage: prog [size]");
std::process::exit(1);
}
()
}
```
Running it and passing different amounts of elements to use to allocate
memory shows that the time taken for `spawn()` can differ quite
significantly. In latter case the `posix_spawnp()` implementation is 30x
faster:
```
$ cargo run --release 10000000
...
Spawn took: 324.275µs
hello
$ cargo run --release 10000000 changepath
...
Overriding PATH
Spawn took: 2.346809ms
hello
$ cargo run --release 100000000
...
Spawn took: 387.842µs
hello
$ cargo run --release 100000000 changepath
...
Overriding PATH
Spawn took: 13.434677ms
hello
```
[1]: 5f72f9800b/posix/execvpe.c (L81)
Cleanup cloudabi mutexes and condvars
This gets rid of lots of unnecessary unsafety.
All the AtomicU32s were wrapped in UnsafeCell or UnsafeCell<MaybeUninit>, and raw pointers were used to get to the AtomicU32 inside. This change cleans that up by using AtomicU32 directly.
Also replaces a UnsafeCell<u32> by a safer Cell<u32>.
@rustbot modify labels: +C-cleanup
Static mutex is static
StaticMutex is only ever used with as a static (as the name already suggests). So it doesn't have to be generic over a lifetime, but can simply assume 'static.
This 'static lifetime guarantees the object is never moved, so this is no longer a manually checked requirement for unsafe calls to lock().
@rustbot modify labels: +T-libs +A-concurrency +C-cleanup
Use futex-based thread-parker for Wasm32.
This uses the existing `sys_common/thread_parker/futex.rs` futex-based thread parker (that was already used for Linux) for wasm32 as well (if the wasm32 atomics target feature is enabled, which is not the case by default).
Wasm32 provides the basic futex operations as instructions: https://webassembly.github.io/threads/syntax/instructions.html
These are now exposed from `sys::futex::{futex_wait, futex_wake}`, just like on Linux. So, `thread_parker/futex.rs` stays completely unmodified.
StaticMutex is only ever used with as a static (as the name already
suggests). So it doesn't have to be generic over a lifetime, but can
simply assume 'static.
This 'static lifetime guarantees the object is never moved, so this is
no longer a manually checked requirement for unsafe calls to lock().
Remove unsafety from sys/unsupported and add deny(unsafe_op_in_unsafe_fn).
Replacing `UnsafeCell`s by a `Cell`s simplifies things and makes the mutex and rwlock implementations safe. Other than that, only unsafety in strlen() contained unsafe code.
@rustbot modify labels: +F-unsafe-block-in-unsafe-fn +C-cleanup
Implement `AsRawFd` for `StdinLock` etc. on WASI.
WASI implements `AsRawFd` for `Stdin`, `Stdout`, and `Stderr`, so
implement it for `StdinLock`, `StdoutLock`, and `StderrLock` as well.
r? @alexcrichton
Avoid SeqCst or static mut in mach_timebase_info and QueryPerformanceFrequency caches
This patch went through a couple iterations but the end result is replacing a pattern where an `AtomicUsize` (updated with many SeqCst ops) guards a `static mut` with a single `AtomicU64` that is known to use 0 as a value indicating that it is not initialized.
The code in both places exists to cache values used in the conversion of Instants to Durations on macOS, iOS, and Windows.
I have no numbers to prove that this improves performance (It seems a little futile to benchmark something like this), but it's much simpler, safer, and in practice we'd expect it to be faster everywhere where Relaxed operations on AtomicU64 are cheaper than SeqCst operations on AtomicUsize, which is a lot of places.
Anyway, it also removes a bunch of unsafe code and greatly simplifies the logic, so IMO that alone would be worth it unless it was a regression.
If you want to take a look at the assembly output though, see https://godbolt.org/z/rbr6vn for x86_64, https://godbolt.org/z/cqcbqv for aarch64 (Note that this just the output of the mac side, but i'd expect the windows part to be the same and don't feel like doing another godbolt for it). There are several versions of this function in the godbolt:
- `info_new`: version in the current patch
- `info_less_new`: version in initial PR
- `info_original`: version currently in the tree
- `info_orig_but_better_orderings`: a version that just tries to change the original code's orderings from SeqCst to the (probably) minimal orderings required for soundness/correctness.
The biggest concern I have here is if we can use AtomicU64, or if there are targets that dont have it that this code supports. AFAICT: no. (If that changes in the future, it's easy enough to do something different for them)
r? `@Amanieu` because he caught a couple issues last time I tried to do a patch reducing orderings 😅
---
<details>
<summary>I rewrote this whole message so the original is inside here</summary>
I happened to notice the code we use for caching the result of mach_timebase_info uses SeqCst exclusively.
However, thinking a little more, it's actually pretty easy to avoid the static mut by packing the timebase info into an AtomicU64.
This entirely avoids needing to do the compare_exchange. The AtomicU64 can be read/written using Relaxed ops, which on current macos/ios platforms (x86_64/aarch64) have no overhead compared to direct loads/stores. This simplifies the code and makes it a lot safer too.
I have no numbers to prove that this improves performance (It seems a little futile to benchmark something like this), although it should do that on both targets it applies to.
That said, it also removes a bunch of unsafe code and simplifies the logic (arguably at least — there are only two states now, initialized or not), so I think it's a net win even without concrete numbers.
If you want to take a look at the assembly output though, see below. It has the new version, the original, and a version of the original with lower Orderings (which is still worse than the version in this PR)
- godbolt.org/z/obfqf9 x86_64-apple-darwin
- godbolt.org/z/Wz5cWc aarch64-unknown-linux-gnu (godbolt can't do aarch64-apple-ios but that doesn't matter here)
A different (and more efficient) option than this would be to just use the AtomicU64 and use the knowledge that after initialization the denominator should be nonzero... That felt like it's relying on too many things I'm not confident in, so I didn't want to do that.
</details>