PTX support, take 2
- You can generate PTX using `--emit=asm` and the right (custom) target. Which
then you can run on a NVIDIA GPU.
- You can compile `core` to PTX. [Xargo] also works and it can compile some
other crates like `collections` (but I doubt all of those make sense on a GPU)
[Xargo]: https://github.com/japaric/xargo
- You can create "global" functions, which can be "called" by the host, using
the `"ptx-kernel"` ABI, e.g. `extern "ptx-kernel" fn kernel() { .. }`. Every
other function is a "device" function and can only be called by the GPU.
- Intrinsics like `__syncthreads()` and `blockIdx.x` are available as
`"platform-intrinsics"`. These intrinsics are *not* in the `core` crate but
any Rust user can create "bindings" to them using an `extern
"platform-intrinsics"` block. See example at the end.
- Trying to emit PTX with `-g` (debuginfo); you get an LLVM error. But I don't
think PTX can contain debuginfo anyway so `-g` should be ignored and a warning
should be printed ("`-g` doesn't work with this target" or something).
- "Single source" support. You *can't* write a single source file that contains
both host and device code. I think that should be possible to implement that
outside the compiler using compiler plugins / build scripts.
- The equivalent to CUDA `__shared__` which it's used to declare memory that's
shared between the threads of the same block. This could be implemented using
attributes: `#[shared] static mut SCRATCH_MEMORY: [f32; 64]` but hasn't been
implemented yet.
- Built-in targets. This PR doesn't add targets to the compiler just yet but one
can create custom targets to be able to emit PTX code (see the example at the
end). The idea is to have people experiment with this feature before
committing to it (built-in targets are "insta-stable")
- All functions must be "inlined". IOW, the `.rlib` must always contain the LLVM
bitcode of all the functions of the crate it was produced from. Otherwise, you
end with "undefined references" in the final PTX code but you won't get *any*
linker error because no linker is involved. IOW, you'll hit a runtime error
when loading the PTX into the GPU. The workaround is to use `#[inline]` on
non-generic functions and to never use `#[inline(never)]` but this may not
always be possible because e.g. you could be relying on third party code.
- Should `--emit=asm` generate a `.ptx` file instead of a `.s` file?
TL;DR Use Xargo to turn a crate into a PTX module (a `.s` file). Then pass that
PTX module, as a string, to the GPU and run it.
The full code is in [this repository]. This section gives an overview of how to
run Rust code on a NVIDIA GPU.
[this repository]: https://github.com/japaric/cuda
- Create a custom target. Here's the 64-bit NVPTX target (NOTE: the comments
are not valid because this is supposed to be a JSON file; remove them before
you use this file):
``` js
// nvptx64-nvidia-cuda.json
{
"arch": "nvptx64", // matches LLVM
"cpu": "sm_20", // "oldest" compute capability supported by LLVM
"data-layout": "e-i64:64-v16:16-v32:32-n16:32:64",
"llvm-target": "nvptx64-nvidia-cuda",
"max-atomic-width": 0, // LLVM errors with any other value :-(
"os": "cuda", // matches LLVM
"panic-strategy": "abort",
"target-endian": "little",
"target-pointer-width": "64",
"target-vendor": "nvidia", // matches LLVM -- not required
}
```
(There's a 32-bit target specification in the linked repository)
- Write a kernel
``` rust
extern "platform-intrinsic" {
fn nvptx_block_dim_x() -> i32;
fn nvptx_block_idx_x() -> i32;
fn nvptx_thread_idx_x() -> i32;
}
/// Copies an array of `n` floating point numbers from `src` to `dst`
pub unsafe extern "ptx-kernel" fn memcpy(dst: *mut f32,
src: *const f32,
n: usize) {
let i = (nvptx_block_dim_x() as isize)
.wrapping_mul(nvptx_block_idx_x() as isize)
.wrapping_add(nvptx_thread_idx_x() as isize);
if (i as usize) < n {
*dst.offset(i) = *src.offset(i);
}
}
```
- Emit PTX code
```
$ xargo rustc --target nvptx64-nvidia-cuda --release -- --emit=asm
Compiling core v0.0.0 (file://..)
(..)
Compiling nvptx-builtins v0.1.0 (https://github.com/japaric/nvptx-builtins)
Compiling kernel v0.1.0
$ cat target/nvptx64-nvidia-cuda/release/deps/kernel-*.s
//
// Generated by LLVM NVPTX Back-End
//
.version 3.2
.target sm_20
.address_size 64
// .globl memcpy
.visible .entry memcpy(
.param .u64 memcpy_param_0,
.param .u64 memcpy_param_1,
.param .u64 memcpy_param_2
)
{
.reg .pred %p<2>;
.reg .s32 %r<5>;
.reg .s64 %rd<12>;
ld.param.u64 %rd7, [memcpy_param_2];
mov.u32 %r1, %ntid.x;
mov.u32 %r2, %ctaid.x;
mul.wide.s32 %rd8, %r2, %r1;
mov.u32 %r3, %tid.x;
cvt.s64.s32 %rd9, %r3;
add.s64 %rd10, %rd9, %rd8;
setp.ge.u64 %p1, %rd10, %rd7;
@%p1 bra LBB0_2;
ld.param.u64 %rd3, [memcpy_param_0];
ld.param.u64 %rd4, [memcpy_param_1];
cvta.to.global.u64 %rd5, %rd4;
cvta.to.global.u64 %rd6, %rd3;
shl.b64 %rd11, %rd10, 2;
add.s64 %rd1, %rd6, %rd11;
add.s64 %rd2, %rd5, %rd11;
ld.global.u32 %r4, [%rd2];
st.global.u32 [%rd1], %r4;
LBB0_2:
ret;
}
```
- Run it on the GPU
``` rust
// `kernel.ptx` is the `*.s` file we got in the previous step
const KERNEL: &'static str = include_str!("kernel.ptx");
driver::initialize()?;
let device = Device(0)?;
let ctx = device.create_context()?;
let module = ctx.load_module(KERNEL)?;
let kernel = module.function("memcpy")?;
let h_a: Vec<f32> = /* create some random data */;
let h_b = vec![0.; N];
let d_a = driver::allocate(bytes)?;
let d_b = driver::allocate(bytes)?;
// Copy from host to GPU
driver::copy(h_a, d_a)?;
// Run `memcpy` on the GPU
kernel.launch(d_b, d_a, N)?;
// Copy from GPU to host
driver::copy(d_b, h_b)?;
// Verify
assert_eq!(h_a, h_b);
// `d_a`, `d_b`, `h_a`, `h_b` are dropped/freed here
```
---
cc @alexcrichton @brson @rkruppe
> What has changed since #34195?
- `core` now can be compiled into PTX. Which makes it very easy to turn `no_std`
crates into "kernels" with the help of Xargo.
- There's now a way, the `"ptx-kernel"` ABI, to generate "global" functions. The
old PR required a manual step (it was hack) to "convert" "device" functions
into "global" functions. (Only "global" functions can be launched by the host)
- Everything is unstable. There are not "insta stable" built-in targets this
time (\*). The users have to use a custom target to experiment with this
feature. Also, PTX instrinsics, like `__syncthreads` and `blockIdx.x`, are now
implemented as `"platform-intrinsics"` so they no longer live in the `core`
crate.
(\*) I'd actually like to have in-tree targets because that makes this target
more discoverable, removes the need to lug around .json files, etc.
However, bundling a target with the compiler immediately puts it in the path
towards stabilization. Which gives us just two cycles to find and fix any
problem with the target specification. Afterwards, it becomes hard to tweak
the specification because that could be a breaking change.
A possible solution could be "unstable built-in targets". Basically, to use an
unstable target, you'll have to also pass `-Z unstable-options` to the compiler.
And unstable targets, being unstable, wouldn't be available on stable.
> Why should this be merged?
- To let people experiment with the feature out of tree. Having easy access to
the feature (in every nightly) allows this. I also think that, as it is, it
should be possible to start prototyping type-safe single source support using
build scripts, macros and/or plugins.
- It's a straightforward implementation. No different that adding support for
any other architecture.
Add some more repos to cargotest
From suggestions at https://users.rust-lang.org/t/what-stable-rust-applications-do-you-use-frequently/7618
This adds some applications which use stable Rust and come with their own lockfiles in their respective trees.
ripgrep, xsv, and bins have 33 unique dependencies between them.
I alphabetized the list by project name because that seems tidier.
r? @brson
appveyor: Attempt to debug flaky test runs
This commit is an attempt to debug #38620 since we're unable to reproduce it
locally. It follows the [advice] of those with AppVeyor to use the `handle.exe`
tool to try to debug what processes have a handle to the file open.
This won't be guaranteed to actually help us, but hopefully it'll diagnose
something at some point?
[advice]: http://help.appveyor.com/discussions/questions/2898
This commit is an attempt to debug #38620 since we're unable to reproduce it
locally. It follows the [advice] of those with AppVeyor to use the `handle.exe`
tool to try to debug what processes have a handle to the file open.
This won't be guaranteed to actually help us, but hopefully it'll diagnose
something at some point?
[advice]: http://help.appveyor.com/discussions/questions/2898
Incrementing the `Archive::child_iterator` fetches and validates the next child.
This can trigger an error, which we previously checked on the *next* call to `LLVMRustArchiveIteratorNext()`.
This means we ignore the last error if we stop iterating halfway through.
This is harmless (we don't access the child, after all) but LLVM 4.0 calls `abort()` if *any* error goes unchecked, even a success value.
This means that basically any rustc invocation that opens an archive and searches through it would die.
The solution implemented here is to change the order of operations, such that
advancing the iterator and fetching the newly-validated iterator happens in the same `Next()` call.
This keeps the error handling behavior as before but ensures all `Error`s get checked.
This commit relegates all pretty tests to not get run by default and rather get
run as part of an "aux" test suite. This "aux" suite is renamed from the old
"cargotest" suite to just collect tests that don't need to run everywhere but
should at least pass on Unix/Windows.
* Update to ripgrep HEAD because the previous rev would still change the lock
file when `cargo build` was issued.
* Remove `bins` as it depends on OpenSSL on Windows, which won't work on our
bots
* Update rev of tokei to get a rev that doesn't change the lockfile
propagate TIME_DEPTH to the helper threads for -Z time-passes
Currently, the timing measurements for LLVM passes and the like don't come out indented, which messes up `perf.rust-lang.org`.
r? @nrc
travis: Attempt to fix Android flakiness
There's been some flaky runs on Travis where the Android emulator is having
problems staying alive... presumably? For example:
* https://travis-ci.org/rust-lang/rust/jobs/186736745
This commit spawns the emulator in the same way as buildbot with `nohup` to hope
that it goes into the background successfully, followed by a `wait-for-device`
command. I'm not actually sure if this'll fix the problems we're seeing, but I
figure it can't hurt to test out.
In an ongoing effort to optimize the runtime of the Android cross builder this
commit updates the pretty test suites to run only for host platforms, not for
target platforms as well. This means we'll still keep running all the suites but
we'll only run them for configured hosts, not for configured targets. This
notably means that we won't be running these suites on Android or musl targets,
for example.
This commit switches the rustbuild build system to compiling the
compiler twice for a normal bootstrap rather than the historical three
times.
Rust is a bootstrapped language which means that a previous version of
the compiler is used to build the next version of the compiler. Over
time, however, we change many parts of compiler artifacts such as the
metadata format, symbol names, etc. These changes make artifacts from
one compiler incompatible from another compiler. Consequently if a
compiler wants to be able to use some artifacts then it itself must have
compiled the artifacts.
Historically the rustc build system has achieved this by compiling the
compiler three times:
* An older compiler (stage0) is downloaded to kick off the chain.
* This compiler now compiles a new compiler (stage1)
* The stage1 compiler then compiles another compiler (stage2)
* Finally, the stage2 compiler needs libraries to link against, so it
compiles all the libraries again.
This entire process amounts in compiling the compiler three times.
Additionally, this process always guarantees that the Rust source tree
can compile itself because the stage2 compiler (created by a freshly
created compiler) would successfully compile itself again. This
property, ensuring Rust can compile itself, is quite important!
In general, though, this third compilation is not required for general
purpose development on the compiler. The third compiler (stage2) can
reuse the libraries that were created during the second compile. In
other words, the second compilation can produce both a compiler and the
libraries that compiler will use. These artifacts *must* be compatible
due to the way plugins work today anyway, and they were created by the
same source code so they *should* be compatible as well.
So given all that, this commit switches the default build process to
only compile the compiler three times, avoiding this third compilation
by copying artifacts from the previous one. Along the way a new entry in
the Travis matrix was also added to ensure that our full bootstrap can
succeed. This entry does not run tests, though, as it should not be
necessary.
To restore the old behavior of a full bootstrap (three compiles) you can
either pass:
./configure --enable-full-bootstrap
or if you're using config.toml:
[build]
full-bootstrap = true
Overall this will hopefully be an easy 33% win in build times of the
compiler. If we do 33% less work we should be 33% faster! This in turn
should affect cycle times and such on Travis and AppVeyor positively as
well as making it easier to work on the compiler itself.
(Minor typo fix.)
Since the word `i32` starts with a vowel, the indefinite article should use "an", not "a" \[[1](http://www.dictionary.com/browse/an)\]. (Previously there was one instance of "an i32" and two instances of "a i32", so at least something is wrong!) Since I believe that "an" is the correct form, I aligned everything with that.
[10/n] Split constants and functions' arguments into disjoint bodies.
_This is part of a series ([prev](https://github.com/rust-lang/rust/pull/38053) | [next]()) of patches designed to rework rustc into an out-of-order on-demand pipeline model for both better feature support (e.g. [MIR-based](https://github.com/solson/miri) early constant evaluation) and incremental execution of compiler passes (e.g. type-checking), with beneficial consequences to IDE support as well.
If any motivation is unclear, please ask for additional PR description clarifications or code comments._
<hr>
Finishes the signature-body split started in #37918, namely:
* `trait` items are separated just like `impl` items were, for uniformity, closing #37712
* `static`s, `const`s (including associated ones), `enum` discriminants and array lengths get bodies
* even the count in "repeat expressions", i.e. `n` in `[x; n]`, which fixes#24414
* arguments' patterns are moved to the bodies, with the types staying in `FnDecl`
* `&self` now desugars to `self: &Self` instead of `self: &_` (similarly for other `self` forms)
* `astconv`'s and metadata's (for rustdoc) informative uses are explicitly ignored for the purposes of the dep graph. this could be fixed in the future by hashing the exact information being extracted about the arguments as opposed to generating a dependency on *the whole body*
Resetting the terminal should first try `sgr0` (as per the comment), not
`sg0` which I believe to be a typo.
This will at least fix rustc output in Emacs terminals (e.g., ansi-term)
with `TERM=eterm-color` which does not provide the next fallback `sgr`. In
such a terminal, the final fallback `op` (`\e[39;49`) is used which
resets only colors, not all attributes. This causes all text to be
printed in bold from the first string printed in bold by rustc onwards,
including the terminal prompt and the output from all following commands.
The typo seems to have been introduced by #29999
A new option is introduced under the `[llvm]` section of `config.toml`,
`targets`, for overriding the list of LLVM targets to build support for.
The option is passed through to LLVM configure script. Also notes are
added about the implications of (ab)using the option; since the default
is not changed, and users of the option are expected to know what
they're doing anyway (as every porter should), the impact should be
minimal.
Fixes#38200.
rustbuild: Hotfix to unbreak nightly
Fixes an oversight unnoticed in #38468 that eventually broke nightly packaging. I didn't realize this until some moments ago, when I finally found out the failure is actually deterministic. Many apologies for eating 3 nightlies during the holidays.
r? @alexcrichton
Fix argument handling on Redox
After switching the start code to be handled in libc, we are no longer passing in slices as arguments into the libstd main function. This means that handling had to be rewritten to match the unix way of doing things.
Additional commits on this branch are going to be merged in this PR: https://github.com/rust-lang/rust/pull/38577#issuecomment-269138394
Refactor mir::dataflow: remove Ctxt associated type from BitDenotation trait
Refactor mir::dataflow: remove Ctxt associated type from BitDenotation trait
I no longer remember why I needed this (or thought I did). The way
that the `BitDenotation` is passed around in all existing use cases
(and planned future ones), the thing that were in the `Ctxt` can just
be part of `Self` instead.
(I think ariel had been pushing me to do this back when I first put in
this infrastructure; it took me a while to see how much of pain the
`Ctxt` was causing.)
On Android we only have one test thread for supposed problems with concurrency
and the remote debugger. Not all of our suites require one concurrency, however,
and suites like compile-fail or pretty can be much faster if they're
parallelized on Travis.
This commit only sets the test threads to one on Android for suites which
actually run code, and other suites aren't tampered with.
Teach `rustdoc --test` about `--sysroot`, pass it when testing rust
This permits rustdoc tests to work in stage0.
Logical continuation of #36586.
Snippet from https://github.com/rust-lang/rust/issues/38575#issuecomment-269090724:
> it should actually be possible to run all the libstd tests immediately after creating std of stage0-out - there's no reason to build librustc at all if you've just made a change to (for example) libcollections, `./x.py test src/libcollections --stage 0 -v --incremental` should just work
This PR makes it so (or appears to in my testing).
r? @alexcrichton
Add a min_atomic_width target option, like max_atomic_width
Rationale: some ISAs, e.g. OR1K, do not have atomic instructions
for byte and halfword access, and at the same time do not have
a fixed endianness, which makes it unreasonable to implement these
through word-sized atomic accesses.
`arr` is the actual list of targets participating in steps construction,
but due to #38468 the hosts array now consists of only the build triple
for the `dist` steps, hence all non-build-triple targets are lost for
the host-only rules.
Fix this by using the original non-shadowed hosts array in `arr`
calculation. This should unbreak the nightly packaging process.
Fixes#38637.