Remove the UnicodeVersion struct containing
major, minor and update fields and replace it with
a 3-tuple containing the version number.
As the value of each field is limited to 255
use u8 to store them.
In practice, for the two data sets that still use the bitset encoding (uppercase
and lowercase) this is not a significant win, so just drop it entirely. It costs
us about 5 bytes, and the complexity is nontrivial.
This arranges for the sparser sets (everything except lower and uppercase) to be
encoded in a significantly smaller context. However, it is also a performance
trade-off (roughly 3x slower than the bitset encoding). The 40% size reduction
is deemed to be sufficiently important to merit this performance loss,
particularly as it is unlikely that this code is hot anywhere (and if it is,
paying the memory cost for a bitset that directly represents the data seems
worthwhile).
Alphabetic : 1599 bytes (- 937 bytes)
Case_Ignorable : 949 bytes (- 822 bytes)
Cased : 359 bytes (- 429 bytes)
Cc : 9 bytes (- 15 bytes)
Grapheme_Extend: 813 bytes (- 675 bytes)
Lowercase : 863 bytes
N : 419 bytes (- 619 bytes)
Uppercase : 776 bytes
White_Space : 37 bytes (- 46 bytes)
Total table sizes: 5824 bytes (-3543 bytes)
LLVM seems to at least sometimes optimize better when the length comes directly
from the `len()` of the array vs. an equivalent integer.
Also, this allows easier copy/pasting of the function into compiler explorer for
experimentation.
We find that it is common for large ranges of chars to be false -- and that
means that it is plausibly common for us to ask about a word that is entirely
empty. Therefore, we should make sure that we do not need to rotate bits or
otherwise perform some operation to map to the zero word; canonicalize it first
if possible.
Previously, all words in the (deduplicated) bitset would be stored raw -- a full
64 bits (8 bytes). Now, those words that are equivalent to others through a
specific mapping are stored separately and "mapped" to the original when
loading; this shrinks the table sizes significantly, as each mapped word is
stored in 2 bytes (a 4x decrease from the previous).
The new encoding is also potentially non-optimal: the "mapped" byte is
frequently repeated, as in practice many mapped words use the same base word.
Currently we only support two forms of mapping: rotation and inversion. Note
that these are both guaranteed to map transitively if at all, and supporting
mappings for which this is not true may require a more interesting algorithm for
choosing the optimal pairing.
Updated sizes:
Alphabetic : 2622 bytes (- 414 bytes)
Case_Ignorable : 1803 bytes (- 330 bytes)
Cased : 808 bytes (- 126 bytes)
Cc : 32 bytes
Grapheme_Extend: 1508 bytes (- 252 bytes)
Lowercase : 901 bytes (- 84 bytes)
N : 1064 bytes (- 156 bytes)
Uppercase : 838 bytes (- 96 bytes)
White_Space : 91 bytes (- 6 bytes)
Total table sizes: 9667 bytes (-1,464 bytes)
This avoids wasting a small amount of space for some of the data sets.
The chunk resizing is caused by but not directly related to changes in this
commit.
Alphabetic : 3036 bytes
Case_Ignorable : 2133 bytes (- 3 bytes)
Cased : 934 bytes
Cc : 32 bytes
Grapheme_Extend: 1760 bytes (-14 bytes)
Lowercase : 985 bytes
N : 1220 bytes (- 5 bytes)
Uppercase : 934 bytes
White_Space : 97 bytes
Total table sizes: 11131 bytes (-22 bytes)
Currently the test file takes a while to compile -- 30 seconds or so -- but
since it's not going to be committed, and is just for local testing, that seems
fine.
Try chunk sizes between 1 and 64, selecting the one which minimizes the number
of bytes used. 16, the previous constant, turned out to be a rather good choice,
with 5/9 of the datasets still using it.
Alphabetic : 3036 bytes (- 19 bytes)
Case_Ignorable : 2136 bytes
Cased : 934 bytes
Cc : 32 bytes (- 11 bytes)
Grapheme_Extend: 1774 bytes
Lowercase : 985 bytes
N : 1225 bytes (- 41 bytes)
Uppercase : 934 bytes
White_Space : 97 bytes (- 43 bytes)
Total table sizes: 11153 bytes (-114 bytes)
If the unicode-downloads folder already exists, we likely just fetched the data,
so don't make any further network requests. Unicode versions are released rarely
enough that this doesn't matter much in practice.