perf: buffer SipHasher128
This is an attempt to improve Siphasher128 performance by buffering input. Although it reduces instruction count, I'm not confident the effect on wall times, or lack-thereof, is worth the change.
---
Additional notes not reflected in source comments:
* Implementation choices were guided by a combination of results from rustc-perf and micro-benchmarks, mostly the former.
* ~~I tried a couple of different struct layouts that might be more cache friendly with no obvious effect.~~ Update: a particular struct layout was chosen, but it's not critical to performance. See comments in source and discussion below.
* I suspect that buffering would be important to a SIMD-accelerated algorithm, but from what I've read and my own tests, SipHash does not seem very amenable to SIMD acceleration, at least by SSE.