Optimize decimal formatting of 128-bit integers
## Description
This PR optimizes the `udivmod_1e19` function, which is used for formatting 128-bit integers, based on the algorithm provided in \[1\]. This optimization improves performance of formatting 128-bit integers, especially on 64-bit architectures. It also slightly reduces the output binary size.
## Assembler comparison
https://godbolt.org/z/YrG5zY
## Performance
#### previous results
```
test fmt::write_u128_max ... bench: 552 ns/iter (+/- 4)
test fmt::write_u128_min ... bench: 125 ns/iter (+/- 2)
```
#### new results
```
test fmt::write_u128_max ... bench: 205 ns/iter (+/- 13)
test fmt::write_u128_min ... bench: 129 ns/iter (+/- 5)
```
## Reference
\[1\] T. Granlund and P. Montgomery, “Division by Invariant Integers Using Multiplication” in Proc. of the SIGPLAN94 Conference on Programming Language Design and Implementation, 1994, pp. 61–72