Any single-threaded task benchmark will spend a good chunk of time in `kqueue()` on osx and `epoll()` on linux, and the reason for this is that each time a task is terminated it will hit the syscall. When a task terminates, it context switches back to the scheduler thread, and the scheduler thread falls out of `run_sched_once` whenever it figures out that it did some work.
If we know that `epoll()` will return nothing, then we can continue to do work locally (only while there's work to be done). We must fall back to `epoll()` whenever there's active I/O in order to check whether it's ready or not, but without that (which is largely the case in benchmarks), we can prevent the costly syscall and can get a nice speedup.
I've separated the commits into preparation for this change and then the change itself, the last commit message has more details.
The old method of building up a list of items and threading it through
all of the decorators was unwieldy and not really scalable as
non-deriving ItemDecorators become possible. The API is now that the
decorator gets an immutable reference to the item it's attached to, and
a callback that it can pass new items to. If we want to add syntax
extensions that can modify the item they're attached to, we can add that
later, but I think it'll have to be separate from ItemDecorator to avoid
strange ordering issues.
These commits pick off some low-hanging fruit which were slowing down spawning green threads. The major speedup comes from fixing a bug in stack caching where we never used any cached stacks!
The program I used to benchmark is at the end. It was compiled with `rustc --opt-level=3 bench.rs --test` and run as `RUST_THREADS=1 ./bench --bench`. I chose to use `RUST_THREADS=1` due to #11730 as the profiles I was getting interfered too much when all the schedulers were in play (and shouldn't be after #11730 is fixed). All of the units below are in ns/iter as reported by `--bench` (lower is better).
| | green | native | raw |
| ------------- | ----- | ------ | ------ |
| osx before | 12699 | 24030 | 19734 |
| linux before | 10223 | 125983 | 122647 |
| osx after | 3847 | 25771 | 20835 |
| linux after | 2631 | 135398 | 122765 |
Note that this is *not* a benchmark of spawning green tasks vs native tasks. I put in the native numbers just to get a ballpark of where green tasks are. This is benchmark is *clearly* benefiting from stack caching. Also, OSX is clearly not 5x faster than linux, I think my VM is just much slower.
All in all, this ended up being a nice 4x speedup for spawning a green task when you're using a cached stack.
```rust
extern mod extra;
extern mod native;
use std::rt:🧵:Thread;
#[bench]
fn green(bh: &mut extra::test::BenchHarness) {
let (p, c) = SharedChan::new();
bh.iter(|| {
let c = c.clone();
spawn(proc() {
c.send(());
});
p.recv();
});
}
#[bench]
fn native(bh: &mut extra::test::BenchHarness) {
let (p, c) = SharedChan::new();
bh.iter(|| {
let c = c.clone();
native::task::spawn(proc() {
c.send(());
});
p.recv();
});
}
#[bench]
fn raw(bh: &mut extra::test::BenchHarness) {
bh.iter(|| {
Thread::start(proc() {}).join()
});
}
```
Two unfortunate allocations were wrapping a proc() in a proc() with
GreenTask::build_start_wrapper, and then boxing this proc in a ~proc() inside of
Context::new(). Both of these allocations were a direct result from two
conditions:
1. The Context::new() function has a nice api of taking a procedure argument to
start up a new context with. This inherently required an allocation by
build_start_wrapper because extra code needed to be run around the edges of a
user-provided proc() for a new task.
2. The initial bootstrap code only understood how to pass one argument to the
next function. By modifying the assembly and entry points to understand more
than one argument, more information is passed through in registers instead of
allocating a pointer-sized context.
This is sadly where I end up throwing mips under a bus because I have no idea
what's going on in the mips context switching code and don't know how to modify
it.
Closes#7767
cc #11389
Instead, use an enum to allow running both a procedure and sending the task
result over a channel. I expect the common case to be sending on a channel (e.g.
task::try), so don't require an extra allocation in the common case.
cc #11389
The condition was the wrong direction and it also didn't take equality into
account. Tests were added for both cases.
For the small benchmark of `task::try(proc() {}).unwrap()`, this takes the
iteration time on OSX from 15119 ns/iter to 6179 ns/iter (timed with
RUST_THREADS=1)
cc #11389
The first setp for #9880 is to add a new `crate` keyword. This PR does exactly that. I took a chance to refactor `parse_item_foreign_mod` and I broke it down into 2 separate methods to isolate each feature.
The next step will be to push a new stage0 snapshot and then get rid of all `extern mod` around the code.
If you were writing to something along the lines of `self.foo` then with the new
closure rules it meant that you were borrowing `self` for the entirety of the
closure, meaning that you couldn't format other fields of `self` at the same
time as writing to a buffer contained in `self`.
By lifting the borrow outside of the closure the borrow checker can better
understand that you're only borrowing one of the fields at a time. This had to
use type ascription as well in order to preserve trait object coercions.
Previously crates like `green` and `native` would still depend on their
parents when running `make check-stage2-green NO_REBUILD=1`, this
ensures that they only depend on their source files.
Also, apply NO_REBUILD to the crate doc tests, so, for example,
`check-stage2-doc-std` will use an already compiled `rustdoc` directly.
It asserted that the previous count was always nonnegative, but DISCONNECTED is
a valid value for it to see. In order to continue to remember to store
DISCONNECTED after DISCONNECTED was seen, I also added a helper method.
Closes#12226
The `id` shouldn't be changed by external code, and exposing it publicly
allows to be accidentally changed.
Also, remove the first element special case in the `select!` macro.
::num::bigint, Remove a source of O(n^2) running time in `fn shr_bits`.
I'll cut to the chase: On my laptop, this brings the running time on
`pidigits 2000` (from src/test/bench/shootout-pidigits.rs) from this:
```
% time ./pidigits 2000 > /dev/null
real 0m7.695s
user 0m7.690s
sys 0m0.005s
```
to this:
```
% time ./pidigits 2000 > /dev/null
real 0m0.322s
user 0m0.318s
sys 0m0.004s
```
The previous code was building up a vector by repeatedly making a
fresh copy for each element that was unshifted onto the front,
yielding quadratic running time. This fixes that by building up the
vector in reverse order (pushing elements onto the end) and then
reversing it.
(Another option would be to build up a zero-initialized vector of the
desired length and then installing all of the shifted result elements
into their target index, but this was easier to hack up quickly, and
yields the desired asymptotic improvement. I have been thinking of
adding a `vec::from_fn_rev` to handle this case, maybe I will try that
this weekend.)
Externally loaded libraries are able to do things that cause references
to them to survive past the expansion phase (e.g. creating @-box cycles,
launching a task or storing something in task local data). As such, the
library has to stay loaded for the lifetime of the process.
This patch gets rid of ObsoleteExternModAttributesInParens and
ObsoleteNamedExternModule since the replacement of `extern mod` with
`extern crate` avoids those cases and raises different errors. Both have
been around for at least a version which makes this a good moment to get
rid of them.
This patch adds a new keyword `crate` which is intended to replace mod
in the context of `extern mod` as part of the issue #9880. The patch
doesn't replace all `extern mod` cases since it is necessary to first
push a new snapshot 0.
The implementation could've been less invasive than this. However I
preferred to take this chance to split the `parse_item_foreign_mod`
method and pull the `extern crate` part out of there, hence the new
method `parse_item_foreign_crate`.
This patch replaces all `crate` usage with `krate` before introducing the
new keyword. This ensures that after introducing the keyword, there
won't be any compilation errors.
krate might not be the most expressive substitution for crate but it's a
very close abbreviation for it. `module` was already used in several
places already.
While working on #11363 I stumbled over a couple of ignored tests, that seem to be fixed or invalid.
* src/test/run-pass/issue-3559.rs was fixed in #4726
* src/test/compile-fail/borrowck-call-sendfn.rs was fixed in #2978
* update src/test/compile-fail/issue-5500-1.rs to work with current Rust (I'm not 100% sure if the original condition is tested as mentioned in #5500, but I think so)
* removed src/test/compile-fail/issue-5500.rs because it is tested in
src/test/run-fail/issue-5500.rs (they are the same test cases, I just renamed src/test/run-fail/addr-of-bot.rs to be consistent with the other issue name
* src/test/run-pass/issue-3559.rs was fixed in #4726
* src/test/compile-fail/borrowck-call-sendfn.rs was fixed in #2978
* update src/test/compile-fail/issue-5500-1.rs to work with current Rust
* removed src/test/compile-fail/issue-5500.rs because it is tested in
src/test/run-fail/issue-5500.rs
* src/test/compile-fail/view-items-at-top.rs fixed
* #897 fixed
* compile-fail/issue-6762.rs issue was closed as dup of #6801
* deleted compile-fail/issue-2074.rs because it became irelevant and is
irrelevant #2074, a test covering this was added in
4f92f452bd
Currently, a scheduler will hit epoll() or kqueue() at the end of *every task*.
The reason is that the scheduler will context switch back to the scheduler task,
terminate the previous task, and then return from run_sched_once. In doing so,
the scheduler will poll for any active I/O.
This shows up painfully in benchmarks that have no I/O at all. For example, this
benchmark:
for _ in range(0, 1000000) {
spawn(proc() {});
}
In this benchmark, the scheduler is currently wasting a good chunk of its time
hitting epoll() when there's always active work to be done (run with
RUST_THREADS=1).
This patch uses the previous two commits to alter the scheduler's behavior to
only return from run_sched_once if no work could be found when trying really
really hard. If there is active I/O, this commit will perform the same as
before, falling back to epoll() to check for I/O completion (to not starve I/O
tasks).
In the benchmark above, I got the following numbers:
12.554s on today's master
3.861s with #12172 applied
2.261s with both this and #12172 applied
cc #8341
This is in preparation for running do_work in a loop while there are no active
I/O handles. This changes the do_work and interpret_message_queue methods to
return a triple where the last element is a boolean flag as to whether work was
done or not.
This commit preserves the same behavior as before, it simply re-structures the
code in preparation for future work.
The green scheduler can optimize its runtime based on this by deciding to not go
to sleep in epoll() if there is no active I/O and there is a task to be stolen.
This is implemented for librustuv by keeping a count of the number of tasks
which are currently homed. If a task is homed, and then performs a blocking I/O
operation, the count will be nonzero while the task is blocked. The homing count
is intentionally 0 when there are I/O handles, but no handles currently blocked.
The reason for this is that epoll() would only be used to wake up the scheduler
anyway.
The crux of this change was to have a `HomingMissile` contain a mutable borrowed
reference back to the `HomeHandle`. The rest of the change was just dealing with
this fallout. This reference is used to decrement the homed handle count in a
HomingMissile's destructor.
Also note that the count maintained is not atomic because all of its
increments/decrements/reads are all on the same I/O thread.
This adopts the rules posted in #10432:
1. If a seek position is negative, then an error is generated
2. Seeks beyond the end-of-file are allowed. Future writes will fill the gap
with data and future reads will return errors.
3. Seeks within the bounds of a file are fine.
Closes#10432
Cleans up a few issues with `fourcc`:
* Corrects the endianness in the docs example
* Removes `#[cfg(not(test))]` (bors might not build this on Windows. If the build fails, I'll re-add it)
* Adds a FIXME referencing the LLVM assert issue we encountered with bors builds on Windows (Same error as #10872)
Loadable syntax extensions don't work when cross compiling (see #12102), so the
fourcc tests all need to be ignored. They're valuable tests, so they shouldn't
be outright ignored, so they're now flagged with ignore-cross-compile