mikros/rust - rust - Gitea.pterpstra.com

Author	SHA1	Message	Date
Lzu Tao	fff822fead	Migrate to numeric associated consts	2020-06-10 01:35:47 +00:00
Pyfisch	7f4048c710	Store UNICODE_VERSION as a tuple Remove the UnicodeVersion struct containing major, minor and update fields and replace it with a 3-tuple containing the version number. As the value of each field is limited to 255 use u8 to store them.	2020-04-11 12:56:25 +02:00
Mark Rousskov	ad679a7f43	Update the documentation comment	2020-03-27 19:02:23 -04:00
Mark Rousskov	b6bc906004	Remove separate encoding for a single nonzero-mapping byte In practice, for the two data sets that still use the bitset encoding (uppercase and lowercase) this is not a significant win, so just drop it entirely. It costs us about 5 bytes, and the complexity is nontrivial.	2020-03-27 19:02:23 -04:00
Mark Rousskov	9c1ceece20	Add skip list based implementation for smaller encoding This arranges for the sparser sets (everything except lower and uppercase) to be encoded in a significantly smaller context. However, it is also a performance trade-off (roughly 3x slower than the bitset encoding). The 40% size reduction is deemed to be sufficiently important to merit this performance loss, particularly as it is unlikely that this code is hot anywhere (and if it is, paying the memory cost for a bitset that directly represents the data seems worthwhile). Alphabetic : 1599 bytes (- 937 bytes) Case_Ignorable : 949 bytes (- 822 bytes) Cased : 359 bytes (- 429 bytes) Cc : 9 bytes (- 15 bytes) Grapheme_Extend: 813 bytes (- 675 bytes) Lowercase : 863 bytes N : 419 bytes (- 619 bytes) Uppercase : 776 bytes White_Space : 37 bytes (- 46 bytes) Total table sizes: 5824 bytes (-3543 bytes)	2020-03-27 19:02:23 -04:00
Mark Rousskov	33b9e6f5cf	Add richer printing	2020-03-24 16:24:47 -04:00
Mark Rousskov	af243d4d91	Avoid relying on const parameters to function LLVM seems to at least sometimes optimize better when the length comes directly from the `len()` of the array vs. an equivalent integer. Also, this allows easier copy/pasting of the function into compiler explorer for experimentation.	2020-03-21 18:01:50 -04:00
Mark Rousskov	a7ec6f8fe0	Arrange for zero to be canonical We find that it is common for large ranges of chars to be false -- and that means that it is plausibly common for us to ask about a word that is entirely empty. Therefore, we should make sure that we do not need to rotate bits or otherwise perform some operation to map to the zero word; canonicalize it first if possible.	2020-03-21 17:53:18 -04:00
Mark Rousskov	233ab2f168	Push the byte of LAST_CHUNK_MAP into the array This optimizes slightly better. Alphabetic : 2536 bytes Case_Ignorable : 1771 bytes Cased : 788 bytes Cc : 24 bytes Grapheme_Extend: 1488 bytes Lowercase : 863 bytes N : 1038 bytes Uppercase : 776 bytes White_Space : 83 bytes Total table sizes: 9367 bytes (-18 bytes; 2 bytes per set)	2020-03-21 17:51:40 -04:00
Mark Rousskov	5f71d98f90	Deduplicate test and primary range_search definitions This ensures that what we test is what we get for final results as well.	2020-03-21 15:21:31 -04:00
Mark Rousskov	7b29b70d6e	Add a right shift mapping This saves less bytes - by far - and is likely not the best operator to choose. But for now, it works -- a better choice may arise later. Alphabetic : 2538 bytes (- 84 bytes) Case_Ignorable : 1773 bytes (- 30 bytes) Cased : 790 bytes (- 18 bytes) Cc : 26 bytes (- 6 bytes) Grapheme_Extend: 1490 bytes (- 18 bytes) Lowercase : 865 bytes (- 36 bytes) N : 1040 bytes (- 24 bytes) Uppercase : 778 bytes (- 60 bytes) White_Space : 85 bytes (- 6 bytes) Total table sizes: 9385 bytes (-282 bytes)	2020-03-21 12:14:26 -04:00
Mark Rousskov	b0e121d9d5	Shrink bitset words through functional mapping Previously, all words in the (deduplicated) bitset would be stored raw -- a full 64 bits (8 bytes). Now, those words that are equivalent to others through a specific mapping are stored separately and "mapped" to the original when loading; this shrinks the table sizes significantly, as each mapped word is stored in 2 bytes (a 4x decrease from the previous). The new encoding is also potentially non-optimal: the "mapped" byte is frequently repeated, as in practice many mapped words use the same base word. Currently we only support two forms of mapping: rotation and inversion. Note that these are both guaranteed to map transitively if at all, and supporting mappings for which this is not true may require a more interesting algorithm for choosing the optimal pairing. Updated sizes: Alphabetic : 2622 bytes (- 414 bytes) Case_Ignorable : 1803 bytes (- 330 bytes) Cased : 808 bytes (- 126 bytes) Cc : 32 bytes Grapheme_Extend: 1508 bytes (- 252 bytes) Lowercase : 901 bytes (- 84 bytes) N : 1064 bytes (- 156 bytes) Uppercase : 838 bytes (- 96 bytes) White_Space : 91 bytes (- 6 bytes) Total table sizes: 9667 bytes (-1,464 bytes)	2020-03-21 11:22:00 -04:00
Mark Rousskov	6c7691a37b	Pre-pop zero chunks before mapping LAST_CHUNK_MAP This avoids wasting a small amount of space for some of the data sets. The chunk resizing is caused by but not directly related to changes in this commit. Alphabetic : 3036 bytes Case_Ignorable : 2133 bytes (- 3 bytes) Cased : 934 bytes Cc : 32 bytes Grapheme_Extend: 1760 bytes (-14 bytes) Lowercase : 985 bytes N : 1220 bytes (- 5 bytes) Uppercase : 934 bytes White_Space : 97 bytes Total table sizes: 11131 bytes (-22 bytes)	2020-03-20 18:38:08 -04:00
Mark Rousskov	580a6342ef	Generate tests for Unicode property data Currently the test file takes a while to compile -- 30 seconds or so -- but since it's not going to be committed, and is just for local testing, that seems fine.	2020-03-20 12:11:13 -04:00
Mark Rousskov	7c4baedb3a	Dynamically choose best chunk size Try chunk sizes between 1 and 64, selecting the one which minimizes the number of bytes used. 16, the previous constant, turned out to be a rather good choice, with 5/9 of the datasets still using it. Alphabetic : 3036 bytes (- 19 bytes) Case_Ignorable : 2136 bytes Cased : 934 bytes Cc : 32 bytes (- 11 bytes) Grapheme_Extend: 1774 bytes Lowercase : 985 bytes N : 1225 bytes (- 41 bytes) Uppercase : 934 bytes White_Space : 97 bytes (- 43 bytes) Total table sizes: 11153 bytes (-114 bytes)	2020-03-20 12:11:13 -04:00
Mark Rousskov	903f67d599	Avoid re-fetching Unicode data If the unicode-downloads folder already exists, we likely just fetched the data, so don't make any further network requests. Unicode versions are released rarely enough that this doesn't matter much in practice.	2020-03-20 12:11:13 -04:00
Matthias Krüger	d3e5177f81	Use .next() instead of .nth(0) on iterators.	2020-03-03 03:15:03 +01:00
Mark Rousskov	064f8885d5	Add unicode table generator	2020-01-14 19:11:15 -05:00

18 Commits