2011-12-23 18:48:08 -08:00
|
|
|
#!/usr/bin/env python
|
2014-02-02 11:47:02 +01:00
|
|
|
#
|
|
|
|
# Copyright 2011-2013 The Rust Project Developers. See the COPYRIGHT
|
|
|
|
# file at the top-level directory of this distribution and at
|
|
|
|
# http://rust-lang.org/COPYRIGHT.
|
|
|
|
#
|
|
|
|
# Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or
|
|
|
|
# http://www.apache.org/licenses/LICENSE-2.0> or the MIT license
|
|
|
|
# <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your
|
|
|
|
# option. This file may not be copied, modified, or distributed
|
|
|
|
# except according to those terms.
|
2011-12-23 18:48:08 -08:00
|
|
|
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
# This script uses the following Unicode tables:
|
|
|
|
# - DerivedCoreProperties.txt
|
2015-04-06 19:42:18 -04:00
|
|
|
# - DerivedNormalizationProps.txt
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
# - EastAsianWidth.txt
|
2015-04-06 19:42:18 -04:00
|
|
|
# - auxiliary/GraphemeBreakProperty.txt
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
# - PropList.txt
|
2015-04-06 19:42:18 -04:00
|
|
|
# - ReadMe.txt
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
# - Scripts.txt
|
|
|
|
# - UnicodeData.txt
|
2011-12-23 18:48:08 -08:00
|
|
|
#
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
# Since this should not require frequent updates, we just store this
|
|
|
|
# out-of-line and check the unicode.rs file into git.
|
2011-12-23 18:48:08 -08:00
|
|
|
|
2017-01-02 13:52:20 +01:00
|
|
|
import fileinput, re, os, sys, operator, math
|
2011-12-23 18:48:08 -08:00
|
|
|
|
2016-07-01 09:46:53 -07:00
|
|
|
preamble = '''// Copyright 2012-2016 The Rust Project Developers. See the COPYRIGHT
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
// file at the top-level directory of this distribution and at
|
|
|
|
// http://rust-lang.org/COPYRIGHT.
|
|
|
|
//
|
|
|
|
// Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or
|
|
|
|
// http://www.apache.org/licenses/LICENSE-2.0> or the MIT license
|
|
|
|
// <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your
|
|
|
|
// option. This file may not be copied, modified, or distributed
|
|
|
|
// except according to those terms.
|
|
|
|
|
2017-05-04 22:36:48 -04:00
|
|
|
// NOTE: The following code was generated by "./unicode.py", do not edit directly
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
|
2014-11-05 12:04:26 -02:00
|
|
|
#![allow(missing_docs, non_upper_case_globals, non_snake_case)]
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
'''
|
|
|
|
|
|
|
|
# Mapping taken from Table 12 from:
|
|
|
|
# http://www.unicode.org/reports/tr44/#General_Category_Values
|
|
|
|
expanded_categories = {
|
|
|
|
'Lu': ['LC', 'L'], 'Ll': ['LC', 'L'], 'Lt': ['LC', 'L'],
|
|
|
|
'Lm': ['L'], 'Lo': ['L'],
|
|
|
|
'Mn': ['M'], 'Mc': ['M'], 'Me': ['M'],
|
|
|
|
'Nd': ['N'], 'Nl': ['N'], 'No': ['No'],
|
|
|
|
'Pc': ['P'], 'Pd': ['P'], 'Ps': ['P'], 'Pe': ['P'],
|
|
|
|
'Pi': ['P'], 'Pf': ['P'], 'Po': ['P'],
|
|
|
|
'Sm': ['S'], 'Sc': ['S'], 'Sk': ['S'], 'So': ['S'],
|
|
|
|
'Zs': ['Z'], 'Zl': ['Z'], 'Zp': ['Z'],
|
|
|
|
'Cc': ['C'], 'Cf': ['C'], 'Cs': ['C'], 'Co': ['C'], 'Cn': ['C'],
|
|
|
|
}
|
2011-12-23 18:48:08 -08:00
|
|
|
|
2015-04-06 19:42:18 -04:00
|
|
|
# these are the surrogate codepoints, which are not valid rust characters
|
|
|
|
surrogate_codepoints = (0xd800, 0xdfff)
|
2014-07-11 17:23:45 -04:00
|
|
|
|
2011-12-23 18:48:08 -08:00
|
|
|
def fetch(f):
|
2015-04-06 19:42:18 -04:00
|
|
|
if not os.path.exists(os.path.basename(f)):
|
2011-12-23 18:48:08 -08:00
|
|
|
os.system("curl -O http://www.unicode.org/Public/UNIDATA/%s"
|
|
|
|
% f)
|
|
|
|
|
2015-04-06 19:42:18 -04:00
|
|
|
if not os.path.exists(os.path.basename(f)):
|
2011-12-23 18:48:08 -08:00
|
|
|
sys.stderr.write("cannot load %s" % f)
|
|
|
|
exit(1)
|
|
|
|
|
2015-03-03 18:35:41 +01:00
|
|
|
def is_surrogate(n):
|
2015-04-06 19:42:18 -04:00
|
|
|
return surrogate_codepoints[0] <= n <= surrogate_codepoints[1]
|
2011-12-23 18:48:08 -08:00
|
|
|
|
2011-12-29 17:24:04 -08:00
|
|
|
def load_unicode_data(f):
|
2011-12-23 18:48:08 -08:00
|
|
|
fetch(f)
|
|
|
|
gencats = {}
|
2015-06-05 16:23:51 +02:00
|
|
|
to_lower = {}
|
|
|
|
to_upper = {}
|
2015-06-05 19:20:09 +02:00
|
|
|
to_title = {}
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
combines = {}
|
2011-12-29 17:24:04 -08:00
|
|
|
canon_decomp = {}
|
|
|
|
compat_decomp = {}
|
2014-02-26 13:49:56 +01:00
|
|
|
|
2016-09-17 23:10:12 -07:00
|
|
|
udict = {}
|
|
|
|
range_start = -1
|
2011-12-23 18:48:08 -08:00
|
|
|
for line in fileinput.input(f):
|
2016-09-17 23:10:12 -07:00
|
|
|
data = line.split(';')
|
2015-03-03 18:35:41 +01:00
|
|
|
if len(data) != 15:
|
2011-12-23 18:48:08 -08:00
|
|
|
continue
|
2016-09-17 23:10:12 -07:00
|
|
|
cp = int(data[0], 16)
|
2015-03-03 18:35:41 +01:00
|
|
|
if is_surrogate(cp):
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
continue
|
2015-03-03 18:35:41 +01:00
|
|
|
if range_start >= 0:
|
2017-11-16 12:39:20 -05:00
|
|
|
for i in range(range_start, cp):
|
2016-09-17 23:10:12 -07:00
|
|
|
udict[i] = data
|
|
|
|
range_start = -1
|
2015-03-03 18:35:41 +01:00
|
|
|
if data[1].endswith(", First>"):
|
2016-09-17 23:10:12 -07:00
|
|
|
range_start = cp
|
|
|
|
continue
|
|
|
|
udict[cp] = data
|
2015-03-03 18:35:41 +01:00
|
|
|
|
|
|
|
for code in udict:
|
2016-09-17 23:06:45 -07:00
|
|
|
(code_org, name, gencat, combine, bidi,
|
2015-03-03 18:35:41 +01:00
|
|
|
decomp, deci, digit, num, mirror,
|
2016-09-17 23:10:12 -07:00
|
|
|
old, iso, upcase, lowcase, titlecase) = udict[code]
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
|
2014-02-26 13:49:56 +01:00
|
|
|
# generate char to char direct common and simple conversions
|
|
|
|
# uppercase to lowercase
|
2015-06-05 16:23:51 +02:00
|
|
|
if lowcase != "" and code_org != lowcase:
|
2015-06-05 17:40:09 +02:00
|
|
|
to_lower[code] = (int(lowcase, 16), 0, 0)
|
2014-02-26 13:49:56 +01:00
|
|
|
|
|
|
|
# lowercase to uppercase
|
2015-06-05 16:23:51 +02:00
|
|
|
if upcase != "" and code_org != upcase:
|
2015-06-05 17:40:09 +02:00
|
|
|
to_upper[code] = (int(upcase, 16), 0, 0)
|
2011-12-23 18:48:08 -08:00
|
|
|
|
2015-06-05 19:20:09 +02:00
|
|
|
# title case
|
|
|
|
if titlecase.strip() != "" and code_org != titlecase:
|
|
|
|
to_title[code] = (int(titlecase, 16), 0, 0)
|
|
|
|
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
# store decomposition, if given
|
2011-12-29 17:24:04 -08:00
|
|
|
if decomp != "":
|
|
|
|
if decomp.startswith('<'):
|
|
|
|
seq = []
|
|
|
|
for i in decomp.split()[1:]:
|
|
|
|
seq.append(int(i, 16))
|
|
|
|
compat_decomp[code] = seq
|
|
|
|
else:
|
|
|
|
seq = []
|
|
|
|
for i in decomp.split():
|
|
|
|
seq.append(int(i, 16))
|
|
|
|
canon_decomp[code] = seq
|
|
|
|
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
# place letter in categories as appropriate
|
2014-07-11 17:23:45 -04:00
|
|
|
for cat in [gencat, "Assigned"] + expanded_categories.get(gencat, []):
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
if cat not in gencats:
|
|
|
|
gencats[cat] = []
|
|
|
|
gencats[cat].append(code)
|
2011-12-23 18:48:08 -08:00
|
|
|
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
# record combining class, if any
|
|
|
|
if combine != "0":
|
|
|
|
if combine not in combines:
|
|
|
|
combines[combine] = []
|
|
|
|
combines[combine].append(code)
|
2011-12-23 18:48:08 -08:00
|
|
|
|
2014-07-11 17:23:45 -04:00
|
|
|
# generate Not_Assigned from Assigned
|
|
|
|
gencats["Cn"] = gen_unassigned(gencats["Assigned"])
|
|
|
|
# Assigned is not a real category
|
|
|
|
del(gencats["Assigned"])
|
|
|
|
# Other contains Not_Assigned
|
|
|
|
gencats["C"].extend(gencats["Cn"])
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
gencats = group_cats(gencats)
|
|
|
|
combines = to_combines(group_cats(combines))
|
2011-12-29 17:24:04 -08:00
|
|
|
|
2015-06-05 19:20:09 +02:00
|
|
|
return (canon_decomp, compat_decomp, gencats, combines, to_upper, to_lower, to_title)
|
2013-08-11 01:57:59 +02:00
|
|
|
|
2015-06-05 19:20:09 +02:00
|
|
|
def load_special_casing(f, to_upper, to_lower, to_title):
|
2015-06-05 17:40:09 +02:00
|
|
|
fetch(f)
|
|
|
|
for line in fileinput.input(f):
|
|
|
|
data = line.split('#')[0].split(';')
|
|
|
|
if len(data) == 5:
|
|
|
|
code, lower, title, upper, _comment = data
|
|
|
|
elif len(data) == 6:
|
|
|
|
code, lower, title, upper, condition, _comment = data
|
|
|
|
if condition.strip(): # Only keep unconditional mappins
|
|
|
|
continue
|
|
|
|
else:
|
|
|
|
continue
|
|
|
|
code = code.strip()
|
|
|
|
lower = lower.strip()
|
|
|
|
title = title.strip()
|
|
|
|
upper = upper.strip()
|
|
|
|
key = int(code, 16)
|
2015-06-05 19:20:09 +02:00
|
|
|
for (map_, values) in [(to_lower, lower), (to_upper, upper), (to_title, title)]:
|
2015-06-05 17:40:09 +02:00
|
|
|
if values != code:
|
|
|
|
values = [int(i, 16) for i in values.split()]
|
|
|
|
for _ in range(len(values), 3):
|
|
|
|
values.append(0)
|
|
|
|
assert len(values) == 3
|
|
|
|
map_[key] = values
|
|
|
|
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
def group_cats(cats):
|
|
|
|
cats_out = {}
|
|
|
|
for cat in cats:
|
|
|
|
cats_out[cat] = group_cat(cats[cat])
|
|
|
|
return cats_out
|
|
|
|
|
|
|
|
def group_cat(cat):
|
|
|
|
cat_out = []
|
|
|
|
letters = sorted(set(cat))
|
|
|
|
cur_start = letters.pop(0)
|
|
|
|
cur_end = cur_start
|
|
|
|
for letter in letters:
|
|
|
|
assert letter > cur_end, \
|
|
|
|
"cur_end: %s, letter: %s" % (hex(cur_end), hex(letter))
|
|
|
|
if letter == cur_end + 1:
|
|
|
|
cur_end = letter
|
2013-08-11 01:57:59 +02:00
|
|
|
else:
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
cat_out.append((cur_start, cur_end))
|
|
|
|
cur_start = cur_end = letter
|
|
|
|
cat_out.append((cur_start, cur_end))
|
|
|
|
return cat_out
|
|
|
|
|
|
|
|
def ungroup_cat(cat):
|
|
|
|
cat_out = []
|
|
|
|
for (lo, hi) in cat:
|
|
|
|
while lo <= hi:
|
|
|
|
cat_out.append(lo)
|
|
|
|
lo += 1
|
|
|
|
return cat_out
|
|
|
|
|
2014-07-11 17:23:45 -04:00
|
|
|
def gen_unassigned(assigned):
|
|
|
|
assigned = set(assigned)
|
|
|
|
return ([i for i in range(0, 0xd800) if i not in assigned] +
|
|
|
|
[i for i in range(0xe000, 0x110000) if i not in assigned])
|
|
|
|
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
def to_combines(combs):
|
|
|
|
combs_out = []
|
|
|
|
for comb in combs:
|
|
|
|
for (lo, hi) in combs[comb]:
|
|
|
|
combs_out.append((lo, hi, comb))
|
|
|
|
combs_out.sort(key=lambda comb: comb[0])
|
|
|
|
return combs_out
|
2013-08-11 01:57:59 +02:00
|
|
|
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
def format_table_content(f, content, indent):
|
|
|
|
line = " "*indent
|
|
|
|
first = True
|
|
|
|
for chunk in content.split(","):
|
|
|
|
if len(line) + len(chunk) < 98:
|
|
|
|
if first:
|
|
|
|
line += chunk
|
|
|
|
else:
|
|
|
|
line += ", " + chunk
|
|
|
|
first = False
|
|
|
|
else:
|
|
|
|
f.write(line + ",\n")
|
|
|
|
line = " "*indent + chunk
|
|
|
|
f.write(line)
|
2011-12-23 18:48:08 -08:00
|
|
|
|
2013-11-26 06:15:45 +01:00
|
|
|
def load_properties(f, interestingprops):
|
2011-12-23 18:48:08 -08:00
|
|
|
fetch(f)
|
2013-11-26 06:15:45 +01:00
|
|
|
props = {}
|
2015-04-16 15:38:35 -04:00
|
|
|
re1 = re.compile("^ *([0-9A-F]+) *; *(\w+)")
|
|
|
|
re2 = re.compile("^ *([0-9A-F]+)\.\.([0-9A-F]+) *; *(\w+)")
|
2011-12-23 18:48:08 -08:00
|
|
|
|
2015-04-06 19:42:18 -04:00
|
|
|
for line in fileinput.input(os.path.basename(f)):
|
2011-12-23 18:48:08 -08:00
|
|
|
prop = None
|
|
|
|
d_lo = 0
|
|
|
|
d_hi = 0
|
|
|
|
m = re1.match(line)
|
|
|
|
if m:
|
|
|
|
d_lo = m.group(1)
|
|
|
|
d_hi = m.group(1)
|
|
|
|
prop = m.group(2)
|
|
|
|
else:
|
|
|
|
m = re2.match(line)
|
|
|
|
if m:
|
|
|
|
d_lo = m.group(1)
|
|
|
|
d_hi = m.group(2)
|
|
|
|
prop = m.group(3)
|
|
|
|
else:
|
|
|
|
continue
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
if interestingprops and prop not in interestingprops:
|
2011-12-23 18:48:08 -08:00
|
|
|
continue
|
|
|
|
d_lo = int(d_lo, 16)
|
|
|
|
d_hi = int(d_hi, 16)
|
2013-11-26 06:15:45 +01:00
|
|
|
if prop not in props:
|
|
|
|
props[prop] = []
|
|
|
|
props[prop].append((d_lo, d_hi))
|
2015-04-16 15:38:35 -04:00
|
|
|
|
|
|
|
# optimize if possible
|
|
|
|
for prop in props:
|
|
|
|
props[prop] = group_cat(ungroup_cat(props[prop]))
|
|
|
|
|
2013-11-26 06:15:45 +01:00
|
|
|
return props
|
2011-12-23 18:48:08 -08:00
|
|
|
|
|
|
|
def escape_char(c):
|
2015-06-05 17:40:09 +02:00
|
|
|
return "'\\u{%x}'" % c if c != 0 else "'\\0'"
|
2011-12-23 18:48:08 -08:00
|
|
|
|
2013-01-08 08:44:31 -08:00
|
|
|
def emit_bsearch_range_table(f):
|
|
|
|
f.write("""
|
2015-10-25 11:19:14 +01:00
|
|
|
fn bsearch_range_table(c: char, r: &'static [(char, char)]) -> bool {
|
2014-11-28 11:57:41 -05:00
|
|
|
use core::cmp::Ordering::{Equal, Less, Greater};
|
2015-10-25 11:19:14 +01:00
|
|
|
r.binary_search_by(|&(lo, hi)| {
|
2016-01-04 17:35:06 +01:00
|
|
|
if c < lo {
|
|
|
|
Greater
|
2015-10-25 11:19:14 +01:00
|
|
|
} else if hi < c {
|
|
|
|
Less
|
|
|
|
} else {
|
2016-01-04 17:35:06 +01:00
|
|
|
Equal
|
2015-10-25 11:19:14 +01:00
|
|
|
}
|
|
|
|
})
|
|
|
|
.is_ok()
|
2014-05-12 19:56:41 +02:00
|
|
|
}\n
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
""")
|
|
|
|
|
|
|
|
def emit_table(f, name, t_data, t_type = "&'static [(char, char)]", is_pub=True,
|
|
|
|
pfun=lambda x: "(%s,%s)" % (escape_char(x[0]), escape_char(x[1]))):
|
|
|
|
pub_string = ""
|
|
|
|
if is_pub:
|
|
|
|
pub_string = "pub "
|
2015-02-27 15:36:53 +01:00
|
|
|
f.write(" %sconst %s: %s = &[\n" % (pub_string, name, t_type))
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
data = ""
|
|
|
|
first = True
|
|
|
|
for dat in t_data:
|
|
|
|
if not first:
|
|
|
|
data += ","
|
|
|
|
first = False
|
|
|
|
data += pfun(dat)
|
|
|
|
format_table_content(f, data, 8)
|
|
|
|
f.write("\n ];\n\n")
|
2013-01-08 08:44:31 -08:00
|
|
|
|
2016-04-19 12:25:28 -07:00
|
|
|
def emit_trie_lookup_range_table(f):
|
|
|
|
f.write("""
|
2016-04-20 21:56:35 -07:00
|
|
|
|
|
|
|
// BoolTrie is a trie for representing a set of Unicode codepoints. It is
|
|
|
|
// implemented with postfix compression (sharing of identical child nodes),
|
|
|
|
// which gives both compact size and fast lookup.
|
|
|
|
//
|
|
|
|
// The space of Unicode codepoints is divided into 3 subareas, each
|
|
|
|
// represented by a trie with different depth. In the first (0..0x800), there
|
|
|
|
// is no trie structure at all; each u64 entry corresponds to a bitvector
|
|
|
|
// effectively holding 64 bool values.
|
|
|
|
//
|
|
|
|
// In the second (0x800..0x10000), each child of the root node represents a
|
|
|
|
// 64-wide subrange, but instead of storing the full 64-bit value of the leaf,
|
|
|
|
// the trie stores an 8-bit index into a shared table of leaf values. This
|
|
|
|
// exploits the fact that in reasonable sets, many such leaves can be shared.
|
|
|
|
//
|
|
|
|
// In the third (0x10000..0x110000), each child of the root node represents a
|
|
|
|
// 4096-wide subrange, and the trie stores an 8-bit index into a 64-byte slice
|
|
|
|
// of a child tree. Each of these 64 bytes represents an index into the table
|
|
|
|
// of shared 64-bit leaf values. This exploits the sparse structure in the
|
|
|
|
// non-BMP range of most Unicode sets.
|
2016-04-19 12:25:28 -07:00
|
|
|
pub struct BoolTrie {
|
|
|
|
// 0..0x800 (corresponding to 1 and 2 byte utf-8 sequences)
|
|
|
|
r1: [u64; 32], // leaves
|
|
|
|
|
|
|
|
// 0x800..0x10000 (corresponding to 3 byte utf-8 sequences)
|
2016-04-20 21:56:35 -07:00
|
|
|
r2: [u8; 992], // first level
|
2016-04-19 12:25:28 -07:00
|
|
|
r3: &'static [u64], // leaves
|
|
|
|
|
|
|
|
// 0x10000..0x110000 (corresponding to 4 byte utf-8 sequences)
|
2016-04-20 21:56:35 -07:00
|
|
|
r4: [u8; 256], // first level
|
2016-04-19 12:25:28 -07:00
|
|
|
r5: &'static [u8], // second level
|
|
|
|
r6: &'static [u64], // leaves
|
|
|
|
}
|
|
|
|
|
|
|
|
fn trie_range_leaf(c: usize, bitmap_chunk: u64) -> bool {
|
|
|
|
((bitmap_chunk >> (c & 63)) & 1) != 0
|
|
|
|
}
|
|
|
|
|
|
|
|
fn trie_lookup_range_table(c: char, r: &'static BoolTrie) -> bool {
|
|
|
|
let c = c as usize;
|
|
|
|
if c < 0x800 {
|
2016-04-19 12:52:23 -07:00
|
|
|
trie_range_leaf(c, r.r1[c >> 6])
|
2016-04-19 12:25:28 -07:00
|
|
|
} else if c < 0x10000 {
|
2016-04-20 21:56:35 -07:00
|
|
|
let child = r.r2[(c >> 6) - 0x20];
|
2016-04-19 12:25:28 -07:00
|
|
|
trie_range_leaf(c, r.r3[child as usize])
|
|
|
|
} else {
|
2016-04-20 21:56:35 -07:00
|
|
|
let child = r.r4[(c >> 12) - 0x10];
|
2016-04-19 12:25:28 -07:00
|
|
|
let leaf = r.r5[((child as usize) << 6) + ((c >> 6) & 0x3f)];
|
|
|
|
trie_range_leaf(c, r.r6[leaf as usize])
|
|
|
|
}
|
2017-01-02 13:52:20 +01:00
|
|
|
}
|
|
|
|
|
|
|
|
pub struct SmallBoolTrie {
|
|
|
|
r1: &'static [u8], // first level
|
|
|
|
r2: &'static [u64], // leaves
|
|
|
|
}
|
|
|
|
|
|
|
|
impl SmallBoolTrie {
|
|
|
|
fn lookup(&self, c: char) -> bool {
|
|
|
|
let c = c as usize;
|
|
|
|
match self.r1.get(c >> 6) {
|
|
|
|
Some(&child) => trie_range_leaf(c, self.r2[child as usize]),
|
|
|
|
None => false,
|
|
|
|
}
|
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2016-04-19 12:25:28 -07:00
|
|
|
""")
|
|
|
|
|
|
|
|
def compute_trie(rawdata, chunksize):
|
|
|
|
root = []
|
|
|
|
childmap = {}
|
|
|
|
child_data = []
|
2017-11-16 12:39:20 -05:00
|
|
|
for i in range(len(rawdata) // chunksize):
|
2016-04-19 12:25:28 -07:00
|
|
|
data = rawdata[i * chunksize: (i + 1) * chunksize]
|
|
|
|
child = '|'.join(map(str, data))
|
|
|
|
if child not in childmap:
|
|
|
|
childmap[child] = len(childmap)
|
|
|
|
child_data.extend(data)
|
|
|
|
root.append(childmap[child])
|
|
|
|
return (root, child_data)
|
|
|
|
|
|
|
|
def emit_bool_trie(f, name, t_data, is_pub=True):
|
|
|
|
CHUNK = 64
|
2016-09-17 23:10:12 -07:00
|
|
|
rawdata = [False] * 0x110000
|
2016-04-19 12:25:28 -07:00
|
|
|
for (lo, hi) in t_data:
|
|
|
|
for cp in range(lo, hi + 1):
|
|
|
|
rawdata[cp] = True
|
|
|
|
|
|
|
|
# convert to bitmap chunks of 64 bits each
|
|
|
|
chunks = []
|
2017-11-16 12:39:20 -05:00
|
|
|
for i in range(0x110000 // CHUNK):
|
2016-04-19 12:25:28 -07:00
|
|
|
chunk = 0
|
|
|
|
for j in range(64):
|
|
|
|
if rawdata[i * 64 + j]:
|
|
|
|
chunk |= 1 << j
|
|
|
|
chunks.append(chunk)
|
|
|
|
|
|
|
|
pub_string = ""
|
|
|
|
if is_pub:
|
|
|
|
pub_string = "pub "
|
|
|
|
f.write(" %sconst %s: &'static super::BoolTrie = &super::BoolTrie {\n" % (pub_string, name))
|
|
|
|
f.write(" r1: [\n")
|
2017-11-16 12:39:20 -05:00
|
|
|
data = ','.join('0x%016x' % chunk for chunk in chunks[0:0x800 // CHUNK])
|
2016-04-19 12:25:28 -07:00
|
|
|
format_table_content(f, data, 12)
|
|
|
|
f.write("\n ],\n")
|
|
|
|
|
|
|
|
# 0x800..0x10000 trie
|
2017-11-16 12:39:20 -05:00
|
|
|
(r2, r3) = compute_trie(chunks[0x800 // CHUNK : 0x10000 // CHUNK], 64 // CHUNK)
|
2016-04-19 12:25:28 -07:00
|
|
|
f.write(" r2: [\n")
|
2016-04-20 21:56:35 -07:00
|
|
|
data = ','.join(str(node) for node in r2)
|
2016-04-19 12:25:28 -07:00
|
|
|
format_table_content(f, data, 12)
|
|
|
|
f.write("\n ],\n")
|
|
|
|
f.write(" r3: &[\n")
|
|
|
|
data = ','.join('0x%016x' % chunk for chunk in r3)
|
|
|
|
format_table_content(f, data, 12)
|
|
|
|
f.write("\n ],\n")
|
|
|
|
|
|
|
|
# 0x10000..0x110000 trie
|
2017-11-16 12:39:20 -05:00
|
|
|
(mid, r6) = compute_trie(chunks[0x10000 // CHUNK : 0x110000 // CHUNK], 64 // CHUNK)
|
2016-04-19 12:25:28 -07:00
|
|
|
(r4, r5) = compute_trie(mid, 64)
|
|
|
|
f.write(" r4: [\n")
|
2016-04-20 21:56:35 -07:00
|
|
|
data = ','.join(str(node) for node in r4)
|
2016-04-19 12:25:28 -07:00
|
|
|
format_table_content(f, data, 12)
|
|
|
|
f.write("\n ],\n")
|
|
|
|
f.write(" r5: &[\n")
|
|
|
|
data = ','.join(str(node) for node in r5)
|
|
|
|
format_table_content(f, data, 12)
|
|
|
|
f.write("\n ],\n")
|
|
|
|
f.write(" r6: &[\n")
|
|
|
|
data = ','.join('0x%016x' % chunk for chunk in r6)
|
|
|
|
format_table_content(f, data, 12)
|
|
|
|
f.write("\n ],\n")
|
|
|
|
|
|
|
|
f.write(" };\n\n")
|
|
|
|
|
2017-01-02 13:52:20 +01:00
|
|
|
def emit_small_bool_trie(f, name, t_data, is_pub=True):
|
2017-11-16 12:39:20 -05:00
|
|
|
last_chunk = max(hi // 64 for (lo, hi) in t_data)
|
2017-01-02 13:52:20 +01:00
|
|
|
n_chunks = last_chunk + 1
|
|
|
|
chunks = [0] * n_chunks
|
|
|
|
for (lo, hi) in t_data:
|
|
|
|
for cp in range(lo, hi + 1):
|
2017-11-16 12:39:20 -05:00
|
|
|
if cp // 64 >= len(chunks):
|
|
|
|
print(cp, cp // 64, len(chunks), lo, hi)
|
|
|
|
chunks[cp // 64] |= 1 << (cp & 63)
|
2017-01-02 13:52:20 +01:00
|
|
|
|
|
|
|
pub_string = ""
|
|
|
|
if is_pub:
|
|
|
|
pub_string = "pub "
|
|
|
|
f.write(" %sconst %s: &'static super::SmallBoolTrie = &super::SmallBoolTrie {\n"
|
|
|
|
% (pub_string, name))
|
|
|
|
|
|
|
|
(r1, r2) = compute_trie(chunks, 1)
|
|
|
|
|
|
|
|
f.write(" r1: &[\n")
|
|
|
|
data = ','.join(str(node) for node in r1)
|
|
|
|
format_table_content(f, data, 12)
|
|
|
|
f.write("\n ],\n")
|
|
|
|
|
|
|
|
f.write(" r2: &[\n")
|
|
|
|
data = ','.join('0x%016x' % node for node in r2)
|
|
|
|
format_table_content(f, data, 12)
|
|
|
|
f.write("\n ],\n")
|
|
|
|
|
|
|
|
f.write(" };\n\n")
|
|
|
|
|
2015-04-12 13:24:19 +12:00
|
|
|
def emit_property_module(f, mod, tbl, emit):
|
2013-01-08 08:44:31 -08:00
|
|
|
f.write("pub mod %s {\n" % mod)
|
2015-04-12 13:24:19 +12:00
|
|
|
for cat in sorted(emit):
|
2017-01-02 13:52:20 +01:00
|
|
|
if cat in ["Cc", "White_Space", "Pattern_White_Space"]:
|
|
|
|
emit_small_bool_trie(f, "%s_table" % cat, tbl[cat])
|
|
|
|
f.write(" pub fn %s(c: char) -> bool {\n" % cat)
|
|
|
|
f.write(" %s_table.lookup(c)\n" % cat)
|
|
|
|
f.write(" }\n\n")
|
|
|
|
else:
|
|
|
|
emit_bool_trie(f, "%s_table" % cat, tbl[cat])
|
|
|
|
f.write(" pub fn %s(c: char) -> bool {\n" % cat)
|
|
|
|
f.write(" super::trie_lookup_range_table(c, %s_table)\n" % cat)
|
|
|
|
f.write(" }\n\n")
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
f.write("}\n\n")
|
2013-01-08 08:44:31 -08:00
|
|
|
|
2015-06-05 19:20:09 +02:00
|
|
|
def emit_conversions_module(f, to_upper, to_lower, to_title):
|
2014-05-12 19:56:41 +02:00
|
|
|
f.write("pub mod conversions {")
|
2014-02-26 13:49:56 +01:00
|
|
|
f.write("""
|
2014-11-28 11:57:41 -05:00
|
|
|
use core::option::Option;
|
|
|
|
use core::option::Option::{Some, None};
|
2014-02-26 13:49:56 +01:00
|
|
|
|
2015-06-05 17:40:09 +02:00
|
|
|
pub fn to_lower(c: char) -> [char; 3] {
|
2015-06-05 16:23:51 +02:00
|
|
|
match bsearch_case_table(c, to_lowercase_table) {
|
2016-01-04 17:27:51 +01:00
|
|
|
None => [c, '\\0', '\\0'],
|
|
|
|
Some(index) => to_lowercase_table[index].1,
|
2014-02-26 13:49:56 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-06-05 17:40:09 +02:00
|
|
|
pub fn to_upper(c: char) -> [char; 3] {
|
2015-06-05 16:23:51 +02:00
|
|
|
match bsearch_case_table(c, to_uppercase_table) {
|
2015-06-05 17:40:09 +02:00
|
|
|
None => [c, '\\0', '\\0'],
|
2016-01-04 17:27:51 +01:00
|
|
|
Some(index) => to_uppercase_table[index].1,
|
2014-02-26 13:49:56 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-06-05 17:40:09 +02:00
|
|
|
fn bsearch_case_table(c: char, table: &'static [(char, [char; 3])]) -> Option<usize> {
|
2016-01-04 17:29:41 +01:00
|
|
|
table.binary_search_by(|&(key, _)| key.cmp(&c)).ok()
|
2014-02-26 13:49:56 +01:00
|
|
|
}
|
2014-05-12 19:56:41 +02:00
|
|
|
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
""")
|
2015-06-05 19:20:09 +02:00
|
|
|
t_type = "&'static [(char, [char; 3])]"
|
|
|
|
pfun = lambda x: "(%s,[%s,%s,%s])" % (
|
|
|
|
escape_char(x[0]), escape_char(x[1][0]), escape_char(x[1][1]), escape_char(x[1][2]))
|
2015-06-05 16:23:51 +02:00
|
|
|
emit_table(f, "to_lowercase_table",
|
2017-11-16 12:39:20 -05:00
|
|
|
sorted(to_lower.items(), key=operator.itemgetter(0)),
|
2015-06-05 19:20:09 +02:00
|
|
|
is_pub=False, t_type = t_type, pfun=pfun)
|
2015-06-05 16:23:51 +02:00
|
|
|
emit_table(f, "to_uppercase_table",
|
2017-11-16 12:39:20 -05:00
|
|
|
sorted(to_upper.items(), key=operator.itemgetter(0)),
|
2015-06-05 19:20:09 +02:00
|
|
|
is_pub=False, t_type = t_type, pfun=pfun)
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
f.write("}\n\n")
|
2011-12-29 17:24:04 -08:00
|
|
|
|
2014-07-25 22:31:21 +02:00
|
|
|
def emit_norm_module(f, canon, compat, combine, norm_props):
|
2017-11-16 12:39:20 -05:00
|
|
|
canon_keys = sorted(canon.keys())
|
2011-12-29 17:24:04 -08:00
|
|
|
|
2017-11-16 12:39:20 -05:00
|
|
|
compat_keys = sorted(compat.keys())
|
2014-05-12 22:25:38 +02:00
|
|
|
|
2014-07-25 22:31:21 +02:00
|
|
|
canon_comp = {}
|
|
|
|
comp_exclusions = norm_props["Full_Composition_Exclusion"]
|
|
|
|
for char in canon_keys:
|
2017-11-16 12:39:20 -05:00
|
|
|
if any(lo <= char <= hi for lo, hi in comp_exclusions):
|
2014-07-25 22:31:21 +02:00
|
|
|
continue
|
|
|
|
decomp = canon[char]
|
|
|
|
if len(decomp) == 2:
|
2017-11-16 12:39:20 -05:00
|
|
|
if decomp[0] not in canon_comp:
|
2014-07-25 22:31:21 +02:00
|
|
|
canon_comp[decomp[0]] = []
|
|
|
|
canon_comp[decomp[0]].append( (decomp[1], char) )
|
2017-11-16 12:39:20 -05:00
|
|
|
canon_comp_keys = sorted(canon_comp.keys())
|
2014-07-25 22:31:21 +02:00
|
|
|
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
if __name__ == "__main__":
|
2014-07-11 17:23:45 -04:00
|
|
|
r = "tables.rs"
|
2014-05-12 19:56:41 +02:00
|
|
|
if os.path.exists(r):
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
os.remove(r)
|
2014-05-12 19:56:41 +02:00
|
|
|
with open(r, "w") as rf:
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
# write the file's preamble
|
2014-05-12 19:56:41 +02:00
|
|
|
rf.write(preamble)
|
2013-06-29 11:19:14 +10:00
|
|
|
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
# download and parse all the data
|
2014-10-13 13:51:43 +01:00
|
|
|
fetch("ReadMe.txt")
|
|
|
|
with open("ReadMe.txt") as readme:
|
|
|
|
pattern = "for Version (\d+)\.(\d+)\.(\d+) of the Unicode"
|
|
|
|
unicode_version = re.search(pattern, readme.read()).groups()
|
|
|
|
rf.write("""
|
2017-06-30 17:23:55 -06:00
|
|
|
/// Represents a Unicode Version.
|
|
|
|
///
|
|
|
|
/// See also: <http://www.unicode.org/versions/>
|
|
|
|
#[derive(Clone, Copy, Debug, Eq, Ord, PartialEq, PartialOrd)]
|
|
|
|
pub struct UnicodeVersion {
|
|
|
|
/// Major version.
|
|
|
|
pub major: u32,
|
|
|
|
|
|
|
|
/// Minor version.
|
|
|
|
pub minor: u32,
|
|
|
|
|
|
|
|
/// Micro (or Update) version.
|
|
|
|
pub micro: u32,
|
|
|
|
|
|
|
|
// Private field to keep struct expandable.
|
|
|
|
_priv: (),
|
|
|
|
}
|
|
|
|
|
|
|
|
/// The version of [Unicode](http://www.unicode.org/) that the Unicode parts of
|
|
|
|
/// `CharExt` and `UnicodeStrPrelude` traits are based on.
|
|
|
|
pub const UNICODE_VERSION: UnicodeVersion = UnicodeVersion {
|
|
|
|
major: %s,
|
|
|
|
minor: %s,
|
|
|
|
micro: %s,
|
|
|
|
_priv: (),
|
|
|
|
};
|
2014-10-13 13:51:43 +01:00
|
|
|
""" % unicode_version)
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
(canon_decomp, compat_decomp, gencats, combines,
|
2015-06-05 19:20:09 +02:00
|
|
|
to_upper, to_lower, to_title) = load_unicode_data("UnicodeData.txt")
|
|
|
|
load_special_casing("SpecialCasing.txt", to_upper, to_lower, to_title)
|
2015-06-06 12:34:24 +02:00
|
|
|
want_derived = ["XID_Start", "XID_Continue", "Alphabetic", "Lowercase", "Uppercase",
|
|
|
|
"Cased", "Case_Ignorable"]
|
2015-04-12 13:24:19 +12:00
|
|
|
derived = load_properties("DerivedCoreProperties.txt", want_derived)
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
scripts = load_properties("Scripts.txt", [])
|
|
|
|
props = load_properties("PropList.txt",
|
2015-11-12 02:43:43 +00:00
|
|
|
["White_Space", "Join_Control", "Noncharacter_Code_Point", "Pattern_White_Space"])
|
2014-07-25 22:31:21 +02:00
|
|
|
norm_props = load_properties("DerivedNormalizationProps.txt",
|
|
|
|
["Full_Composition_Exclusion"])
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
|
2016-04-19 12:25:28 -07:00
|
|
|
# trie_lookup_table is used in all the property modules below
|
|
|
|
emit_trie_lookup_range_table(rf)
|
|
|
|
# emit_bsearch_range_table(rf)
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
|
2015-04-12 13:24:19 +12:00
|
|
|
# category tables
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
for (name, cat, pfuns) in ("general_category", gencats, ["N", "Cc"]), \
|
|
|
|
("derived_property", derived, want_derived), \
|
2015-11-12 02:43:43 +00:00
|
|
|
("property", props, ["White_Space", "Pattern_White_Space"]):
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
emit_property_module(rf, name, cat, pfuns)
|
|
|
|
|
|
|
|
# normalizations and conversions module
|
2014-07-25 22:31:21 +02:00
|
|
|
emit_norm_module(rf, canon_decomp, compat_decomp, combines, norm_props)
|
2015-06-05 19:20:09 +02:00
|
|
|
emit_conversions_module(rf, to_upper, to_lower, to_title)
|