2011-12-23 18:48:08 -08:00
|
|
|
#!/usr/bin/env python
|
2014-02-02 11:47:02 +01:00
|
|
|
#
|
|
|
|
# Copyright 2011-2013 The Rust Project Developers. See the COPYRIGHT
|
|
|
|
# file at the top-level directory of this distribution and at
|
|
|
|
# http://rust-lang.org/COPYRIGHT.
|
|
|
|
#
|
|
|
|
# Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or
|
|
|
|
# http://www.apache.org/licenses/LICENSE-2.0> or the MIT license
|
|
|
|
# <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your
|
|
|
|
# option. This file may not be copied, modified, or distributed
|
|
|
|
# except according to those terms.
|
2011-12-23 18:48:08 -08:00
|
|
|
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
# This script uses the following Unicode tables:
|
|
|
|
# - DerivedCoreProperties.txt
|
2015-04-06 19:42:18 -04:00
|
|
|
# - DerivedNormalizationProps.txt
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
# - EastAsianWidth.txt
|
2015-04-06 19:42:18 -04:00
|
|
|
# - auxiliary/GraphemeBreakProperty.txt
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
# - PropList.txt
|
2015-04-06 19:42:18 -04:00
|
|
|
# - ReadMe.txt
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
# - Scripts.txt
|
|
|
|
# - UnicodeData.txt
|
2011-12-23 18:48:08 -08:00
|
|
|
#
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
# Since this should not require frequent updates, we just store this
|
|
|
|
# out-of-line and check the unicode.rs file into git.
|
2011-12-23 18:48:08 -08:00
|
|
|
|
2014-02-26 13:49:56 +01:00
|
|
|
import fileinput, re, os, sys, operator
|
2011-12-23 18:48:08 -08:00
|
|
|
|
2015-04-16 15:38:35 -04:00
|
|
|
preamble = '''// Copyright 2012-2015 The Rust Project Developers. See the COPYRIGHT
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
// file at the top-level directory of this distribution and at
|
|
|
|
// http://rust-lang.org/COPYRIGHT.
|
|
|
|
//
|
|
|
|
// Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or
|
|
|
|
// http://www.apache.org/licenses/LICENSE-2.0> or the MIT license
|
|
|
|
// <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your
|
|
|
|
// option. This file may not be copied, modified, or distributed
|
|
|
|
// except according to those terms.
|
|
|
|
|
|
|
|
// NOTE: The following code was generated by "src/etc/unicode.py", do not edit directly
|
|
|
|
|
2014-11-05 12:04:26 -02:00
|
|
|
#![allow(missing_docs, non_upper_case_globals, non_snake_case)]
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
'''
|
|
|
|
|
|
|
|
# Mapping taken from Table 12 from:
|
|
|
|
# http://www.unicode.org/reports/tr44/#General_Category_Values
|
|
|
|
expanded_categories = {
|
|
|
|
'Lu': ['LC', 'L'], 'Ll': ['LC', 'L'], 'Lt': ['LC', 'L'],
|
|
|
|
'Lm': ['L'], 'Lo': ['L'],
|
|
|
|
'Mn': ['M'], 'Mc': ['M'], 'Me': ['M'],
|
|
|
|
'Nd': ['N'], 'Nl': ['N'], 'No': ['No'],
|
|
|
|
'Pc': ['P'], 'Pd': ['P'], 'Ps': ['P'], 'Pe': ['P'],
|
|
|
|
'Pi': ['P'], 'Pf': ['P'], 'Po': ['P'],
|
|
|
|
'Sm': ['S'], 'Sc': ['S'], 'Sk': ['S'], 'So': ['S'],
|
|
|
|
'Zs': ['Z'], 'Zl': ['Z'], 'Zp': ['Z'],
|
|
|
|
'Cc': ['C'], 'Cf': ['C'], 'Cs': ['C'], 'Co': ['C'], 'Cn': ['C'],
|
|
|
|
}
|
2011-12-23 18:48:08 -08:00
|
|
|
|
2015-04-06 19:42:18 -04:00
|
|
|
# these are the surrogate codepoints, which are not valid rust characters
|
|
|
|
surrogate_codepoints = (0xd800, 0xdfff)
|
2014-07-11 17:23:45 -04:00
|
|
|
|
2011-12-23 18:48:08 -08:00
|
|
|
def fetch(f):
|
2015-04-06 19:42:18 -04:00
|
|
|
if not os.path.exists(os.path.basename(f)):
|
2011-12-23 18:48:08 -08:00
|
|
|
os.system("curl -O http://www.unicode.org/Public/UNIDATA/%s"
|
|
|
|
% f)
|
|
|
|
|
2015-04-06 19:42:18 -04:00
|
|
|
if not os.path.exists(os.path.basename(f)):
|
2011-12-23 18:48:08 -08:00
|
|
|
sys.stderr.write("cannot load %s" % f)
|
|
|
|
exit(1)
|
|
|
|
|
2015-03-03 18:35:41 +01:00
|
|
|
def is_surrogate(n):
|
2015-04-06 19:42:18 -04:00
|
|
|
return surrogate_codepoints[0] <= n <= surrogate_codepoints[1]
|
2011-12-23 18:48:08 -08:00
|
|
|
|
2011-12-29 17:24:04 -08:00
|
|
|
def load_unicode_data(f):
|
2011-12-23 18:48:08 -08:00
|
|
|
fetch(f)
|
|
|
|
gencats = {}
|
2015-06-05 16:23:51 +02:00
|
|
|
to_lower = {}
|
|
|
|
to_upper = {}
|
2015-06-05 19:20:09 +02:00
|
|
|
to_title = {}
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
combines = {}
|
2011-12-29 17:24:04 -08:00
|
|
|
canon_decomp = {}
|
|
|
|
compat_decomp = {}
|
2014-02-26 13:49:56 +01:00
|
|
|
|
2015-03-03 18:35:41 +01:00
|
|
|
udict = {};
|
|
|
|
range_start = -1;
|
2011-12-23 18:48:08 -08:00
|
|
|
for line in fileinput.input(f):
|
2015-03-03 18:35:41 +01:00
|
|
|
data = line.split(';');
|
|
|
|
if len(data) != 15:
|
2011-12-23 18:48:08 -08:00
|
|
|
continue
|
2015-03-03 18:35:41 +01:00
|
|
|
cp = int(data[0], 16);
|
|
|
|
if is_surrogate(cp):
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
continue
|
2015-03-03 18:35:41 +01:00
|
|
|
if range_start >= 0:
|
|
|
|
for i in xrange(range_start, cp):
|
|
|
|
udict[i] = data;
|
|
|
|
range_start = -1;
|
|
|
|
if data[1].endswith(", First>"):
|
|
|
|
range_start = cp;
|
|
|
|
continue;
|
|
|
|
udict[cp] = data;
|
|
|
|
|
|
|
|
for code in udict:
|
|
|
|
[code_org, name, gencat, combine, bidi,
|
|
|
|
decomp, deci, digit, num, mirror,
|
|
|
|
old, iso, upcase, lowcase, titlecase ] = udict[code];
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
|
2014-02-26 13:49:56 +01:00
|
|
|
# generate char to char direct common and simple conversions
|
|
|
|
# uppercase to lowercase
|
2015-06-05 16:23:51 +02:00
|
|
|
if lowcase != "" and code_org != lowcase:
|
2015-06-05 17:40:09 +02:00
|
|
|
to_lower[code] = (int(lowcase, 16), 0, 0)
|
2014-02-26 13:49:56 +01:00
|
|
|
|
|
|
|
# lowercase to uppercase
|
2015-06-05 16:23:51 +02:00
|
|
|
if upcase != "" and code_org != upcase:
|
2015-06-05 17:40:09 +02:00
|
|
|
to_upper[code] = (int(upcase, 16), 0, 0)
|
2011-12-23 18:48:08 -08:00
|
|
|
|
2015-06-05 19:20:09 +02:00
|
|
|
# title case
|
|
|
|
if titlecase.strip() != "" and code_org != titlecase:
|
|
|
|
to_title[code] = (int(titlecase, 16), 0, 0)
|
|
|
|
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
# store decomposition, if given
|
2011-12-29 17:24:04 -08:00
|
|
|
if decomp != "":
|
|
|
|
if decomp.startswith('<'):
|
|
|
|
seq = []
|
|
|
|
for i in decomp.split()[1:]:
|
|
|
|
seq.append(int(i, 16))
|
|
|
|
compat_decomp[code] = seq
|
|
|
|
else:
|
|
|
|
seq = []
|
|
|
|
for i in decomp.split():
|
|
|
|
seq.append(int(i, 16))
|
|
|
|
canon_decomp[code] = seq
|
|
|
|
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
# place letter in categories as appropriate
|
2014-07-11 17:23:45 -04:00
|
|
|
for cat in [gencat, "Assigned"] + expanded_categories.get(gencat, []):
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
if cat not in gencats:
|
|
|
|
gencats[cat] = []
|
|
|
|
gencats[cat].append(code)
|
2011-12-23 18:48:08 -08:00
|
|
|
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
# record combining class, if any
|
|
|
|
if combine != "0":
|
|
|
|
if combine not in combines:
|
|
|
|
combines[combine] = []
|
|
|
|
combines[combine].append(code)
|
2011-12-23 18:48:08 -08:00
|
|
|
|
2014-07-11 17:23:45 -04:00
|
|
|
# generate Not_Assigned from Assigned
|
|
|
|
gencats["Cn"] = gen_unassigned(gencats["Assigned"])
|
|
|
|
# Assigned is not a real category
|
|
|
|
del(gencats["Assigned"])
|
|
|
|
# Other contains Not_Assigned
|
|
|
|
gencats["C"].extend(gencats["Cn"])
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
gencats = group_cats(gencats)
|
|
|
|
combines = to_combines(group_cats(combines))
|
2011-12-29 17:24:04 -08:00
|
|
|
|
2015-06-05 19:20:09 +02:00
|
|
|
return (canon_decomp, compat_decomp, gencats, combines, to_upper, to_lower, to_title)
|
2013-08-11 01:57:59 +02:00
|
|
|
|
2015-06-05 19:20:09 +02:00
|
|
|
def load_special_casing(f, to_upper, to_lower, to_title):
|
2015-06-05 17:40:09 +02:00
|
|
|
fetch(f)
|
|
|
|
for line in fileinput.input(f):
|
|
|
|
data = line.split('#')[0].split(';')
|
|
|
|
if len(data) == 5:
|
|
|
|
code, lower, title, upper, _comment = data
|
|
|
|
elif len(data) == 6:
|
|
|
|
code, lower, title, upper, condition, _comment = data
|
|
|
|
if condition.strip(): # Only keep unconditional mappins
|
|
|
|
continue
|
|
|
|
else:
|
|
|
|
continue
|
|
|
|
code = code.strip()
|
|
|
|
lower = lower.strip()
|
|
|
|
title = title.strip()
|
|
|
|
upper = upper.strip()
|
|
|
|
key = int(code, 16)
|
2015-06-05 19:20:09 +02:00
|
|
|
for (map_, values) in [(to_lower, lower), (to_upper, upper), (to_title, title)]:
|
2015-06-05 17:40:09 +02:00
|
|
|
if values != code:
|
|
|
|
values = [int(i, 16) for i in values.split()]
|
|
|
|
for _ in range(len(values), 3):
|
|
|
|
values.append(0)
|
|
|
|
assert len(values) == 3
|
|
|
|
map_[key] = values
|
|
|
|
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
def group_cats(cats):
|
|
|
|
cats_out = {}
|
|
|
|
for cat in cats:
|
|
|
|
cats_out[cat] = group_cat(cats[cat])
|
|
|
|
return cats_out
|
|
|
|
|
|
|
|
def group_cat(cat):
|
|
|
|
cat_out = []
|
|
|
|
letters = sorted(set(cat))
|
|
|
|
cur_start = letters.pop(0)
|
|
|
|
cur_end = cur_start
|
|
|
|
for letter in letters:
|
|
|
|
assert letter > cur_end, \
|
|
|
|
"cur_end: %s, letter: %s" % (hex(cur_end), hex(letter))
|
|
|
|
if letter == cur_end + 1:
|
|
|
|
cur_end = letter
|
2013-08-11 01:57:59 +02:00
|
|
|
else:
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
cat_out.append((cur_start, cur_end))
|
|
|
|
cur_start = cur_end = letter
|
|
|
|
cat_out.append((cur_start, cur_end))
|
|
|
|
return cat_out
|
|
|
|
|
|
|
|
def ungroup_cat(cat):
|
|
|
|
cat_out = []
|
|
|
|
for (lo, hi) in cat:
|
|
|
|
while lo <= hi:
|
|
|
|
cat_out.append(lo)
|
|
|
|
lo += 1
|
|
|
|
return cat_out
|
|
|
|
|
2014-07-11 17:23:45 -04:00
|
|
|
def gen_unassigned(assigned):
|
|
|
|
assigned = set(assigned)
|
|
|
|
return ([i for i in range(0, 0xd800) if i not in assigned] +
|
|
|
|
[i for i in range(0xe000, 0x110000) if i not in assigned])
|
|
|
|
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
def to_combines(combs):
|
|
|
|
combs_out = []
|
|
|
|
for comb in combs:
|
|
|
|
for (lo, hi) in combs[comb]:
|
|
|
|
combs_out.append((lo, hi, comb))
|
|
|
|
combs_out.sort(key=lambda comb: comb[0])
|
|
|
|
return combs_out
|
2013-08-11 01:57:59 +02:00
|
|
|
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
def format_table_content(f, content, indent):
|
|
|
|
line = " "*indent
|
|
|
|
first = True
|
|
|
|
for chunk in content.split(","):
|
|
|
|
if len(line) + len(chunk) < 98:
|
|
|
|
if first:
|
|
|
|
line += chunk
|
|
|
|
else:
|
|
|
|
line += ", " + chunk
|
|
|
|
first = False
|
|
|
|
else:
|
|
|
|
f.write(line + ",\n")
|
|
|
|
line = " "*indent + chunk
|
|
|
|
f.write(line)
|
2011-12-23 18:48:08 -08:00
|
|
|
|
2013-11-26 06:15:45 +01:00
|
|
|
def load_properties(f, interestingprops):
|
2011-12-23 18:48:08 -08:00
|
|
|
fetch(f)
|
2013-11-26 06:15:45 +01:00
|
|
|
props = {}
|
2015-04-16 15:38:35 -04:00
|
|
|
re1 = re.compile("^ *([0-9A-F]+) *; *(\w+)")
|
|
|
|
re2 = re.compile("^ *([0-9A-F]+)\.\.([0-9A-F]+) *; *(\w+)")
|
2011-12-23 18:48:08 -08:00
|
|
|
|
2015-04-06 19:42:18 -04:00
|
|
|
for line in fileinput.input(os.path.basename(f)):
|
2011-12-23 18:48:08 -08:00
|
|
|
prop = None
|
|
|
|
d_lo = 0
|
|
|
|
d_hi = 0
|
|
|
|
m = re1.match(line)
|
|
|
|
if m:
|
|
|
|
d_lo = m.group(1)
|
|
|
|
d_hi = m.group(1)
|
|
|
|
prop = m.group(2)
|
|
|
|
else:
|
|
|
|
m = re2.match(line)
|
|
|
|
if m:
|
|
|
|
d_lo = m.group(1)
|
|
|
|
d_hi = m.group(2)
|
|
|
|
prop = m.group(3)
|
|
|
|
else:
|
|
|
|
continue
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
if interestingprops and prop not in interestingprops:
|
2011-12-23 18:48:08 -08:00
|
|
|
continue
|
|
|
|
d_lo = int(d_lo, 16)
|
|
|
|
d_hi = int(d_hi, 16)
|
2013-11-26 06:15:45 +01:00
|
|
|
if prop not in props:
|
|
|
|
props[prop] = []
|
|
|
|
props[prop].append((d_lo, d_hi))
|
2015-04-16 15:38:35 -04:00
|
|
|
|
|
|
|
# optimize if possible
|
|
|
|
for prop in props:
|
|
|
|
props[prop] = group_cat(ungroup_cat(props[prop]))
|
|
|
|
|
2013-11-26 06:15:45 +01:00
|
|
|
return props
|
2011-12-23 18:48:08 -08:00
|
|
|
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
# load all widths of want_widths, except those in except_cats
|
|
|
|
def load_east_asian_width(want_widths, except_cats):
|
|
|
|
f = "EastAsianWidth.txt"
|
|
|
|
fetch(f)
|
|
|
|
widths = {}
|
|
|
|
re1 = re.compile("^([0-9A-F]+);(\w+) +# (\w+)")
|
|
|
|
re2 = re.compile("^([0-9A-F]+)\.\.([0-9A-F]+);(\w+) +# (\w+)")
|
|
|
|
|
|
|
|
for line in fileinput.input(f):
|
|
|
|
width = None
|
|
|
|
d_lo = 0
|
|
|
|
d_hi = 0
|
|
|
|
cat = None
|
|
|
|
m = re1.match(line)
|
|
|
|
if m:
|
|
|
|
d_lo = m.group(1)
|
|
|
|
d_hi = m.group(1)
|
|
|
|
width = m.group(2)
|
|
|
|
cat = m.group(3)
|
|
|
|
else:
|
|
|
|
m = re2.match(line)
|
|
|
|
if m:
|
|
|
|
d_lo = m.group(1)
|
|
|
|
d_hi = m.group(2)
|
|
|
|
width = m.group(3)
|
|
|
|
cat = m.group(4)
|
|
|
|
else:
|
|
|
|
continue
|
|
|
|
if cat in except_cats or width not in want_widths:
|
|
|
|
continue
|
|
|
|
d_lo = int(d_lo, 16)
|
|
|
|
d_hi = int(d_hi, 16)
|
|
|
|
if width not in widths:
|
|
|
|
widths[width] = []
|
|
|
|
widths[width].append((d_lo, d_hi))
|
|
|
|
return widths
|
|
|
|
|
2011-12-23 18:48:08 -08:00
|
|
|
def escape_char(c):
|
2015-06-05 17:40:09 +02:00
|
|
|
return "'\\u{%x}'" % c if c != 0 else "'\\0'"
|
2011-12-23 18:48:08 -08:00
|
|
|
|
2013-01-08 08:44:31 -08:00
|
|
|
def emit_bsearch_range_table(f):
|
|
|
|
f.write("""
|
2014-03-01 07:40:38 +01:00
|
|
|
fn bsearch_range_table(c: char, r: &'static [(char,char)]) -> bool {
|
2014-11-28 11:57:41 -05:00
|
|
|
use core::cmp::Ordering::{Equal, Less, Greater};
|
2014-12-11 09:44:17 -08:00
|
|
|
use core::slice::SliceExt;
|
2015-02-27 15:36:53 +01:00
|
|
|
r.binary_search_by(|&(lo,hi)| {
|
2014-03-01 07:40:38 +01:00
|
|
|
if lo <= c && c <= hi { Equal }
|
|
|
|
else if hi < c { Less }
|
|
|
|
else { Greater }
|
2015-02-27 15:36:53 +01:00
|
|
|
}).is_ok()
|
2014-05-12 19:56:41 +02:00
|
|
|
}\n
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
""")
|
|
|
|
|
|
|
|
def emit_table(f, name, t_data, t_type = "&'static [(char, char)]", is_pub=True,
|
|
|
|
pfun=lambda x: "(%s,%s)" % (escape_char(x[0]), escape_char(x[1]))):
|
|
|
|
pub_string = ""
|
|
|
|
if is_pub:
|
|
|
|
pub_string = "pub "
|
2015-02-27 15:36:53 +01:00
|
|
|
f.write(" %sconst %s: %s = &[\n" % (pub_string, name, t_type))
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
data = ""
|
|
|
|
first = True
|
|
|
|
for dat in t_data:
|
|
|
|
if not first:
|
|
|
|
data += ","
|
|
|
|
first = False
|
|
|
|
data += pfun(dat)
|
|
|
|
format_table_content(f, data, 8)
|
|
|
|
f.write("\n ];\n\n")
|
2013-01-08 08:44:31 -08:00
|
|
|
|
2015-04-12 13:24:19 +12:00
|
|
|
def emit_property_module(f, mod, tbl, emit):
|
2013-01-08 08:44:31 -08:00
|
|
|
f.write("pub mod %s {\n" % mod)
|
2015-04-12 13:24:19 +12:00
|
|
|
for cat in sorted(emit):
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
emit_table(f, "%s_table" % cat, tbl[cat])
|
2015-04-12 13:24:19 +12:00
|
|
|
f.write(" pub fn %s(c: char) -> bool {\n" % cat)
|
|
|
|
f.write(" super::bsearch_range_table(c, %s_table)\n" % cat)
|
|
|
|
f.write(" }\n\n")
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
f.write("}\n\n")
|
2013-01-08 08:44:31 -08:00
|
|
|
|
2015-06-05 19:20:09 +02:00
|
|
|
def emit_conversions_module(f, to_upper, to_lower, to_title):
|
2014-05-12 19:56:41 +02:00
|
|
|
f.write("pub mod conversions {")
|
2014-02-26 13:49:56 +01:00
|
|
|
f.write("""
|
2014-11-28 11:57:41 -05:00
|
|
|
use core::cmp::Ordering::{Equal, Less, Greater};
|
2014-12-11 09:44:17 -08:00
|
|
|
use core::slice::SliceExt;
|
2014-11-28 11:57:41 -05:00
|
|
|
use core::option::Option;
|
|
|
|
use core::option::Option::{Some, None};
|
2015-02-27 15:36:53 +01:00
|
|
|
use core::result::Result::{Ok, Err};
|
2014-02-26 13:49:56 +01:00
|
|
|
|
2015-06-05 17:40:09 +02:00
|
|
|
pub fn to_lower(c: char) -> [char; 3] {
|
2015-06-05 16:23:51 +02:00
|
|
|
match bsearch_case_table(c, to_lowercase_table) {
|
2015-06-05 17:40:09 +02:00
|
|
|
None => [c, '\\0', '\\0'],
|
2015-06-05 16:23:51 +02:00
|
|
|
Some(index) => to_lowercase_table[index].1
|
2014-02-26 13:49:56 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-06-05 17:40:09 +02:00
|
|
|
pub fn to_upper(c: char) -> [char; 3] {
|
2015-06-05 16:23:51 +02:00
|
|
|
match bsearch_case_table(c, to_uppercase_table) {
|
2015-06-05 17:40:09 +02:00
|
|
|
None => [c, '\\0', '\\0'],
|
2015-06-05 16:23:51 +02:00
|
|
|
Some(index) => to_uppercase_table[index].1
|
2014-02-26 13:49:56 +01:00
|
|
|
}
|
|
|
|
}
|
|
|
|
|
2015-06-05 17:40:09 +02:00
|
|
|
fn bsearch_case_table(c: char, table: &'static [(char, [char; 3])]) -> Option<usize> {
|
2015-02-27 15:36:53 +01:00
|
|
|
match table.binary_search_by(|&(key, _)| {
|
2014-02-26 13:49:56 +01:00
|
|
|
if c == key { Equal }
|
|
|
|
else if key < c { Less }
|
|
|
|
else { Greater }
|
2014-08-06 20:48:25 -07:00
|
|
|
}) {
|
2015-02-27 15:36:53 +01:00
|
|
|
Ok(i) => Some(i),
|
|
|
|
Err(_) => None,
|
2014-08-06 20:48:25 -07:00
|
|
|
}
|
2014-02-26 13:49:56 +01:00
|
|
|
}
|
2014-05-12 19:56:41 +02:00
|
|
|
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
""")
|
2015-06-05 19:20:09 +02:00
|
|
|
t_type = "&'static [(char, [char; 3])]"
|
|
|
|
pfun = lambda x: "(%s,[%s,%s,%s])" % (
|
|
|
|
escape_char(x[0]), escape_char(x[1][0]), escape_char(x[1][1]), escape_char(x[1][2]))
|
2015-06-05 16:23:51 +02:00
|
|
|
emit_table(f, "to_lowercase_table",
|
2015-06-05 17:40:09 +02:00
|
|
|
sorted(to_lower.iteritems(), key=operator.itemgetter(0)),
|
2015-06-05 19:20:09 +02:00
|
|
|
is_pub=False, t_type = t_type, pfun=pfun)
|
2015-06-05 16:23:51 +02:00
|
|
|
emit_table(f, "to_uppercase_table",
|
2015-06-05 17:40:09 +02:00
|
|
|
sorted(to_upper.iteritems(), key=operator.itemgetter(0)),
|
2015-06-05 19:20:09 +02:00
|
|
|
is_pub=False, t_type = t_type, pfun=pfun)
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
f.write("}\n\n")
|
2011-12-29 17:24:04 -08:00
|
|
|
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
def emit_charwidth_module(f, width_table):
|
|
|
|
f.write("pub mod charwidth {\n")
|
2014-11-28 11:57:41 -05:00
|
|
|
f.write(" use core::option::Option;\n")
|
|
|
|
f.write(" use core::option::Option::{Some, None};\n")
|
2014-12-11 09:44:17 -08:00
|
|
|
f.write(" use core::slice::SliceExt;\n")
|
2015-02-27 15:36:53 +01:00
|
|
|
f.write(" use core::result::Result::{Ok, Err};\n")
|
2013-08-07 20:48:10 +02:00
|
|
|
f.write("""
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
fn bsearch_range_value_table(c: char, is_cjk: bool, r: &'static [(char, char, u8, u8)]) -> u8 {
|
2014-11-28 11:57:41 -05:00
|
|
|
use core::cmp::Ordering::{Equal, Less, Greater};
|
2015-02-27 15:36:53 +01:00
|
|
|
match r.binary_search_by(|&(lo, hi, _, _)| {
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
if lo <= c && c <= hi { Equal }
|
|
|
|
else if hi < c { Less }
|
2013-08-07 20:48:10 +02:00
|
|
|
else { Greater }
|
|
|
|
}) {
|
2015-02-27 15:36:53 +01:00
|
|
|
Ok(idx) => {
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
let (_, _, r_ncjk, r_cjk) = r[idx];
|
|
|
|
if is_cjk { r_cjk } else { r_ncjk }
|
2013-08-07 20:48:10 +02:00
|
|
|
}
|
2015-02-27 15:36:53 +01:00
|
|
|
Err(_) => 1
|
2013-08-07 20:48:10 +02:00
|
|
|
}
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
}
|
2013-08-07 20:48:10 +02:00
|
|
|
""")
|
2013-08-11 01:57:59 +02:00
|
|
|
|
2014-05-12 22:25:38 +02:00
|
|
|
f.write("""
|
2015-02-15 00:09:40 +03:00
|
|
|
pub fn width(c: char, is_cjk: bool) -> Option<usize> {
|
|
|
|
match c as usize {
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
_c @ 0 => Some(0), // null is zero width
|
|
|
|
cu if cu < 0x20 => None, // control sequences have no width
|
|
|
|
cu if cu < 0x7F => Some(1), // ASCII
|
|
|
|
cu if cu < 0xA0 => None, // more control sequences
|
2015-02-15 00:09:40 +03:00
|
|
|
_ => Some(bsearch_range_value_table(c, is_cjk, charwidth_table) as usize)
|
2014-05-12 22:25:38 +02:00
|
|
|
}
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
}
|
2014-05-12 22:25:38 +02:00
|
|
|
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
""")
|
2011-12-29 17:24:04 -08:00
|
|
|
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
f.write(" // character width table. Based on Markus Kuhn's free wcwidth() implementation,\n")
|
|
|
|
f.write(" // http://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c\n")
|
|
|
|
emit_table(f, "charwidth_table", width_table, "&'static [(char, char, u8, u8)]", is_pub=False,
|
|
|
|
pfun=lambda x: "(%s,%s,%s,%s)" % (escape_char(x[0]), escape_char(x[1]), x[2], x[3]))
|
2014-07-11 17:23:45 -04:00
|
|
|
f.write("}\n\n")
|
2011-12-29 17:24:04 -08:00
|
|
|
|
2014-07-25 22:31:21 +02:00
|
|
|
def emit_norm_module(f, canon, compat, combine, norm_props):
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
canon_keys = canon.keys()
|
|
|
|
canon_keys.sort()
|
2011-12-29 17:24:04 -08:00
|
|
|
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
compat_keys = compat.keys()
|
|
|
|
compat_keys.sort()
|
2014-05-12 22:25:38 +02:00
|
|
|
|
2014-07-25 22:31:21 +02:00
|
|
|
canon_comp = {}
|
|
|
|
comp_exclusions = norm_props["Full_Composition_Exclusion"]
|
|
|
|
for char in canon_keys:
|
|
|
|
if True in map(lambda (lo, hi): lo <= char <= hi, comp_exclusions):
|
|
|
|
continue
|
|
|
|
decomp = canon[char]
|
|
|
|
if len(decomp) == 2:
|
|
|
|
if not canon_comp.has_key(decomp[0]):
|
|
|
|
canon_comp[decomp[0]] = []
|
|
|
|
canon_comp[decomp[0]].append( (decomp[1], char) )
|
|
|
|
canon_comp_keys = canon_comp.keys()
|
|
|
|
canon_comp_keys.sort()
|
|
|
|
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
def remove_from_wtable(wtable, val):
|
|
|
|
wtable_out = []
|
|
|
|
while wtable:
|
|
|
|
if wtable[0][1] < val:
|
|
|
|
wtable_out.append(wtable.pop(0))
|
|
|
|
elif wtable[0][0] > val:
|
|
|
|
break
|
|
|
|
else:
|
|
|
|
(wt_lo, wt_hi, width, width_cjk) = wtable.pop(0)
|
|
|
|
if wt_lo == wt_hi == val:
|
|
|
|
continue
|
|
|
|
elif wt_lo == val:
|
|
|
|
wtable_out.append((wt_lo+1, wt_hi, width, width_cjk))
|
|
|
|
elif wt_hi == val:
|
|
|
|
wtable_out.append((wt_lo, wt_hi-1, width, width_cjk))
|
|
|
|
else:
|
|
|
|
wtable_out.append((wt_lo, val-1, width, width_cjk))
|
|
|
|
wtable_out.append((val+1, wt_hi, width, width_cjk))
|
|
|
|
if wtable:
|
|
|
|
wtable_out.extend(wtable)
|
|
|
|
return wtable_out
|
|
|
|
|
2014-07-11 17:23:45 -04:00
|
|
|
|
|
|
|
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
def optimize_width_table(wtable):
|
|
|
|
wtable_out = []
|
|
|
|
w_this = wtable.pop(0)
|
|
|
|
while wtable:
|
|
|
|
if w_this[1] == wtable[0][0] - 1 and w_this[2:3] == wtable[0][2:3]:
|
|
|
|
w_tmp = wtable.pop(0)
|
|
|
|
w_this = (w_this[0], w_tmp[1], w_tmp[2], w_tmp[3])
|
|
|
|
else:
|
|
|
|
wtable_out.append(w_this)
|
|
|
|
w_this = wtable.pop(0)
|
|
|
|
wtable_out.append(w_this)
|
|
|
|
return wtable_out
|
2014-05-12 19:56:41 +02:00
|
|
|
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
if __name__ == "__main__":
|
2014-07-11 17:23:45 -04:00
|
|
|
r = "tables.rs"
|
2014-05-12 19:56:41 +02:00
|
|
|
if os.path.exists(r):
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
os.remove(r)
|
2014-05-12 19:56:41 +02:00
|
|
|
with open(r, "w") as rf:
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
# write the file's preamble
|
2014-05-12 19:56:41 +02:00
|
|
|
rf.write(preamble)
|
2013-06-29 11:19:14 +10:00
|
|
|
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
# download and parse all the data
|
2014-10-13 13:51:43 +01:00
|
|
|
fetch("ReadMe.txt")
|
|
|
|
with open("ReadMe.txt") as readme:
|
|
|
|
pattern = "for Version (\d+)\.(\d+)\.(\d+) of the Unicode"
|
|
|
|
unicode_version = re.search(pattern, readme.read()).groups()
|
|
|
|
rf.write("""
|
|
|
|
/// The version of [Unicode](http://www.unicode.org/)
|
2015-02-27 15:36:53 +01:00
|
|
|
/// that the unicode parts of `CharExt` and `UnicodeStrPrelude` traits are based on.
|
2015-02-15 00:09:40 +03:00
|
|
|
pub const UNICODE_VERSION: (u64, u64, u64) = (%s, %s, %s);
|
2014-10-13 13:51:43 +01:00
|
|
|
""" % unicode_version)
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
(canon_decomp, compat_decomp, gencats, combines,
|
2015-06-05 19:20:09 +02:00
|
|
|
to_upper, to_lower, to_title) = load_unicode_data("UnicodeData.txt")
|
|
|
|
load_special_casing("SpecialCasing.txt", to_upper, to_lower, to_title)
|
2015-06-06 12:34:24 +02:00
|
|
|
want_derived = ["XID_Start", "XID_Continue", "Alphabetic", "Lowercase", "Uppercase",
|
|
|
|
"Cased", "Case_Ignorable"]
|
2015-04-12 13:24:19 +12:00
|
|
|
derived = load_properties("DerivedCoreProperties.txt", want_derived)
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
scripts = load_properties("Scripts.txt", [])
|
|
|
|
props = load_properties("PropList.txt",
|
|
|
|
["White_Space", "Join_Control", "Noncharacter_Code_Point"])
|
2014-07-25 22:31:21 +02:00
|
|
|
norm_props = load_properties("DerivedNormalizationProps.txt",
|
|
|
|
["Full_Composition_Exclusion"])
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
|
|
|
|
# bsearch_range_table is used in all the property modules below
|
|
|
|
emit_bsearch_range_table(rf)
|
|
|
|
|
2015-04-12 13:24:19 +12:00
|
|
|
# category tables
|
Add libunicode; move unicode functions from core
- created new crate, libunicode, below libstd
- split Char trait into Char (libcore) and UnicodeChar (libunicode)
- Unicode-aware functions now live in libunicode
- is_alphabetic, is_XID_start, is_XID_continue, is_lowercase,
is_uppercase, is_whitespace, is_alphanumeric, is_control,
is_digit, to_uppercase, to_lowercase
- added width method in UnicodeChar trait
- determines printed width of character in columns, or None if it is
a non-NULL control character
- takes a boolean argument indicating whether the present context is
CJK or not (characters with 'A'mbiguous widths are double-wide in
CJK contexts, single-wide otherwise)
- split StrSlice into StrSlice (libcore) and UnicodeStrSlice
(libunicode)
- functionality formerly in StrSlice that relied upon Unicode
functionality from Char is now in UnicodeStrSlice
- words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right
- also moved Words type alias into libunicode because words method is
in UnicodeStrSlice
- unified Unicode tables from libcollections, libcore, and libregex into
libunicode
- updated unicode.py in src/etc to generate aforementioned tables
- generated new tables based on latest Unicode data
- added UnicodeChar and UnicodeStrSlice traits to prelude
- libunicode is now the collection point for the std::char module,
combining the libunicode functionality with the Char functionality
from libcore
- thus, moved doc comment for char from core::char to unicode::char
- libcollections remains the collection point for std::str
The Unicode-aware functions that previously lived in the Char and
StrSlice traits are no longer available to programs that only use
libcore. To regain use of these methods, include the libunicode crate
and use the UnicodeChar and/or UnicodeStrSlice traits:
extern crate unicode;
use unicode::UnicodeChar;
use unicode::UnicodeStrSlice;
use unicode::Words; // if you want to use the words() method
NOTE: this does *not* impact programs that use libstd, since UnicodeChar
and UnicodeStrSlice have been added to the prelude.
closes #15224
[breaking-change]
2014-06-30 17:04:10 -04:00
|
|
|
for (name, cat, pfuns) in ("general_category", gencats, ["N", "Cc"]), \
|
|
|
|
("derived_property", derived, want_derived), \
|
|
|
|
("property", props, ["White_Space"]):
|
|
|
|
emit_property_module(rf, name, cat, pfuns)
|
|
|
|
|
|
|
|
# normalizations and conversions module
|
2014-07-25 22:31:21 +02:00
|
|
|
emit_norm_module(rf, canon_decomp, compat_decomp, combines, norm_props)
|
2015-06-05 19:20:09 +02:00
|
|
|
emit_conversions_module(rf, to_upper, to_lower, to_title)
|