Enable parallel codegen (2 units) by default when --opt-level is 0 or 1. This
gives a minor speedup on large crates (~10%), with only a tiny slowdown (~2%)
for small ones (which usually build in under a second regardless). The current
default (no parallelization) is used when the user requests optimization
(--opt-level 2 or 3), and when the user has enabled LTO (which is incompatible
with parallel codegen).
This commit also changes the rust build system to use parallel codegen
when appropriate. This means codegen-units=4 for stage0 always, and
also for stage1 and stage2 when configured with --disable-optimize.
(Other settings use codegen-units=1 for stage1 and stage2, to get
maximum performance for release binaries.) The build system also sets
codegen-units=1 for compiletest tests (compiletest does its own
parallelization) and uses the same setting as stage2 for crate tests.