Auto merge of #127226 - mat-1:optimize-siphash-round, r=nnethercote

Optimize SipHash by reordering compress instructions

This PR optimizes hashing by changing the order of instructions in the sip.rs `compress` macro so the CPU can parallelize it better. The new order is taken directly from Fig 2.1 in [the SipHash paper](https://eprint.iacr.org/2012/351.pdf) (but with the xors moved which makes it a little faster). I attempted to optimize it some more after this, but I think this might be the optimal instruction order. Note that this shouldn't change the behavior of hashing at all, only statements that don't depend on each other were reordered.

It appears like the current order hasn't changed since its [original implementation from 2012](fada46c421 (diff-b751133c229259d7099bbbc7835324e5504b91ab1aded9464f0c48cd22e5e420R35)) which doesn't look like it was written with data dependencies in mind.

Running `./x bench library/core --stage 0 --test-args hash` before and after this change shows the following results:

Before:
```
benchmarks:
    hash::sip::bench_bytes_4             7.20/iter +/- 0.70
    hash::sip::bench_bytes_7             9.01/iter +/- 0.35
    hash::sip::bench_bytes_8             8.12/iter +/- 0.10
    hash::sip::bench_bytes_a_16         10.07/iter +/- 0.44
    hash::sip::bench_bytes_b_32         13.46/iter +/- 0.71
    hash::sip::bench_bytes_c_128        37.75/iter +/- 0.48
    hash::sip::bench_long_str          121.18/iter +/- 3.01
    hash::sip::bench_str_of_8_bytes     11.20/iter +/- 0.25
    hash::sip::bench_str_over_8_bytes   11.20/iter +/- 0.26
    hash::sip::bench_str_under_8_bytes   9.89/iter +/- 0.59
    hash::sip::bench_u32                 9.57/iter +/- 0.44
    hash::sip::bench_u32_keyed           6.97/iter +/- 0.10
    hash::sip::bench_u64                 8.63/iter +/- 0.07
```
After:
```
benchmarks:
    hash::sip::bench_bytes_4             6.64/iter +/- 0.14
    hash::sip::bench_bytes_7             8.19/iter +/- 0.07
    hash::sip::bench_bytes_8             8.59/iter +/- 0.68
    hash::sip::bench_bytes_a_16          9.73/iter +/- 0.49
    hash::sip::bench_bytes_b_32         12.70/iter +/- 0.06
    hash::sip::bench_bytes_c_128        32.38/iter +/- 0.20
    hash::sip::bench_long_str          102.99/iter +/- 0.82
    hash::sip::bench_str_of_8_bytes     10.71/iter +/- 0.21
    hash::sip::bench_str_over_8_bytes   11.73/iter +/- 0.17
    hash::sip::bench_str_under_8_bytes  10.33/iter +/- 0.41
    hash::sip::bench_u32                10.41/iter +/- 0.29
    hash::sip::bench_u32_keyed           9.50/iter +/- 0.30
    hash::sip::bench_u64                 8.44/iter +/- 1.09
```
I ran this on my computer so there's some noise, but you can tell at least `bench_long_str` is significantly faster (~18%).

Also, I noticed the same compress function from the library is used in the compiler as well, so I took the liberty of copy-pasting this change to there as well.

Thanks `@semisol` for porting SipHash for another project which led me to notice this issue in Rust, and for helping investigate. <3
This commit is contained in:
bors 2024-07-04 04:03:45 +00:00
commit f6fa358a18
2 changed files with 12 additions and 10 deletions

View File

@ -70,18 +70,19 @@ macro_rules! compress {
($state:expr) => {{ compress!($state.v0, $state.v1, $state.v2, $state.v3) }};
($v0:expr, $v1:expr, $v2:expr, $v3:expr) => {{
$v0 = $v0.wrapping_add($v1);
$v2 = $v2.wrapping_add($v3);
$v1 = $v1.rotate_left(13);
$v1 ^= $v0;
$v0 = $v0.rotate_left(32);
$v2 = $v2.wrapping_add($v3);
$v3 = $v3.rotate_left(16);
$v3 ^= $v2;
$v0 = $v0.wrapping_add($v3);
$v3 = $v3.rotate_left(21);
$v3 ^= $v0;
$v0 = $v0.rotate_left(32);
$v2 = $v2.wrapping_add($v1);
$v0 = $v0.wrapping_add($v3);
$v1 = $v1.rotate_left(17);
$v1 ^= $v2;
$v3 = $v3.rotate_left(21);
$v3 ^= $v0;
$v2 = $v2.rotate_left(32);
}};
}

View File

@ -76,18 +76,19 @@ macro_rules! compress {
($state:expr) => {{ compress!($state.v0, $state.v1, $state.v2, $state.v3) }};
($v0:expr, $v1:expr, $v2:expr, $v3:expr) => {{
$v0 = $v0.wrapping_add($v1);
$v2 = $v2.wrapping_add($v3);
$v1 = $v1.rotate_left(13);
$v1 ^= $v0;
$v0 = $v0.rotate_left(32);
$v2 = $v2.wrapping_add($v3);
$v3 = $v3.rotate_left(16);
$v3 ^= $v2;
$v0 = $v0.wrapping_add($v3);
$v3 = $v3.rotate_left(21);
$v3 ^= $v0;
$v0 = $v0.rotate_left(32);
$v2 = $v2.wrapping_add($v1);
$v0 = $v0.wrapping_add($v3);
$v1 = $v1.rotate_left(17);
$v1 ^= $v2;
$v3 = $v3.rotate_left(21);
$v3 ^= $v0;
$v2 = $v2.rotate_left(32);
}};
}