0b3311c260
Instead of reading a byte at a time in a loop we copy the relevant bytes into a temporary vector of size eight. We can then read the value from the temporary vector using a single u64 read. LLVM seems to be able to optimize this almost scarily good.