Use less divisions in display u128/i128
This PR is an absolute mess, and I need to test if it improves the speed of fmt::Display for u128/i128, but I think it's correct.
It hopefully is more efficient by cutting u128 into at most 2 u64s, and also chunks by 1e16 instead of just 1e4.
Also I specialized the implementations for uints to always be non-false because it bothered me that it was checked at all
Do not merge until I benchmark it and also clean up the god awful mess of spaghetti.
Based on prior work in #44583
cc: `@Dylan-DPC`
Due to work on `itoa` and suggestion in original issue:
r? `@dtolnay`