rust/src/librustc_unicode/lib.rs

// Copyright 2012-2014 The Rust Project Developers. See the COPYRIGHT
// file at the top-level directory of this distribution and at
// http://rust-lang.org/COPYRIGHT.
//
// Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or
// http://www.apache.org/licenses/LICENSE-2.0> or the MIT license
// <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your
// option. This file may not be copied, modified, or distributed
// except according to those terms.

//! # The Unicode Library
//!
//! Unicode-intensive functions for `char` and `str` types.
//!
//! This crate provides a collection of Unicode-related functionality,
//! including decompositions, conversions, etc., and provides traits
//! implementing these functions for the `char` and `str` types.
//!
//! The functionality included here is only that which is necessary to
//! provide for basic string-related manipulations. This crate does not
//! (yet) aim to provide a full set of Unicode tables.

#![crate_name = "rustc_unicode"]
#![unstable(feature = "unicode", issue = "27783")]
#![crate_type = "rlib"]
#![doc(html_logo_url = "https://www.rust-lang.org/logos/rust-logo-128x128-blk-v2.png",
       html_favicon_url = "https://doc.rust-lang.org/favicon.ico",
       html_root_url = "https://doc.rust-lang.org/nightly/",
       html_playground_url = "https://play.rust-lang.org/",
       issue_tracker_base_url = "https://github.com/rust-lang/rust/issues/",
       test(no_crate_inject, attr(allow(unused_variables), deny(warnings))))]
#![cfg_attr(not(stage0), deny(warnings))]
#![no_std]

#![feature(core_char_ext)]
#![feature(lang_items)]
#![feature(staged_api)]
#![feature(unicode)]

mod tables;
mod u_str;
pub mod char;

#[allow(deprecated)]
pub mod str {
    pub use u_str::{UnicodeStr, SplitWhitespace};
    pub use u_str::{utf8_char_width, is_utf16};
    pub use u_str::{Utf16Encoder};
}

// For use in libcollections, not re-exported in libstd.
pub mod derived_property {
    pub use tables::derived_property::{Cased, Case_Ignorable};
}

// For use in libsyntax
pub mod property {
    pub use tables::property::Pattern_White_Space;
}
Add libunicode; move unicode functions from core - created new crate, libunicode, below libstd - split Char trait into Char (libcore) and UnicodeChar (libunicode) - Unicode-aware functions now live in libunicode - is_alphabetic, is_XID_start, is_XID_continue, is_lowercase, is_uppercase, is_whitespace, is_alphanumeric, is_control, is_digit, to_uppercase, to_lowercase - added width method in UnicodeChar trait - determines printed width of character in columns, or None if it is a non-NULL control character - takes a boolean argument indicating whether the present context is CJK or not (characters with 'A'mbiguous widths are double-wide in CJK contexts, single-wide otherwise) - split StrSlice into StrSlice (libcore) and UnicodeStrSlice (libunicode) - functionality formerly in StrSlice that relied upon Unicode functionality from Char is now in UnicodeStrSlice - words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right - also moved Words type alias into libunicode because words method is in UnicodeStrSlice - unified Unicode tables from libcollections, libcore, and libregex into libunicode - updated unicode.py in src/etc to generate aforementioned tables - generated new tables based on latest Unicode data - added UnicodeChar and UnicodeStrSlice traits to prelude - libunicode is now the collection point for the std::char module, combining the libunicode functionality with the Char functionality from libcore - thus, moved doc comment for char from core::char to unicode::char - libcollections remains the collection point for std::str The Unicode-aware functions that previously lived in the Char and StrSlice traits are no longer available to programs that only use libcore. To regain use of these methods, include the libunicode crate and use the UnicodeChar and/or UnicodeStrSlice traits: extern crate unicode; use unicode::UnicodeChar; use unicode::UnicodeStrSlice; use unicode::Words; // if you want to use the words() method NOTE: this does not impact programs that use libstd, since UnicodeChar and UnicodeStrSlice have been added to the prelude. closes #15224 [breaking-change] 2014-06-30 16:04:10 -05:00			`// Copyright 2012-2014 The Rust Project Developers. See the COPYRIGHT`
			`// file at the top-level directory of this distribution and at`
			`// http://rust-lang.org/COPYRIGHT.`
			`//`
			`// Licensed under the Apache License, Version 2.0 <LICENSE-APACHE or`
			`// http://www.apache.org/licenses/LICENSE-2.0> or the MIT license`
			`// <LICENSE-MIT or http://opensource.org/licenses/MIT>, at your`
			`// option. This file may not be copied, modified, or distributed`
			`// except according to those terms.`

			`//! # The Unicode Library`
			`//!`
			//! Unicode-intensive functions for `char` and `str` types.
			`//!`
			`//! This crate provides a collection of Unicode-related functionality,`
			`//! including decompositions, conversions, etc., and provides traits`
			//! implementing these functions for the `char` and `str` types.
			`//!`
			`//! The functionality included here is only that which is necessary to`
			`//! provide for basic string-related manipulations. This crate does not`
			`//! (yet) aim to provide a full set of Unicode tables.`

deprecate Unicode functions that will be moved to crates.io This patch 1. renames libunicode to librustc_unicode, 2. deprecates several pieces of libunicode (see below), and 3. removes references to deprecated functions from librustc_driver and libsyntax. This may change pretty-printed output from these modules in cases involving wide or combining characters used in filenames, identifiers, etc. The following functions are marked deprecated: 1. char.width() and str.width(): --> use unicode-width crate 2. str.graphemes() and str.grapheme_indices(): --> use unicode-segmentation crate 3. str.nfd_chars(), str.nfkd_chars(), str.nfc_chars(), str.nfkc_chars(), char.compose(), char.decompose_canonical(), char.decompose_compatible(), char.canonical_combining_class(): --> use unicode-normalization crate 2015-04-14 14:52:37 -05:00			`#![crate_name = "rustc_unicode"]`
rustc_unicode: Add issues for unstable features 2015-08-13 00:19:39 -05:00			`#![unstable(feature = "unicode", issue = "27783")]`
Add libunicode; move unicode functions from core - created new crate, libunicode, below libstd - split Char trait into Char (libcore) and UnicodeChar (libunicode) - Unicode-aware functions now live in libunicode - is_alphabetic, is_XID_start, is_XID_continue, is_lowercase, is_uppercase, is_whitespace, is_alphanumeric, is_control, is_digit, to_uppercase, to_lowercase - added width method in UnicodeChar trait - determines printed width of character in columns, or None if it is a non-NULL control character - takes a boolean argument indicating whether the present context is CJK or not (characters with 'A'mbiguous widths are double-wide in CJK contexts, single-wide otherwise) - split StrSlice into StrSlice (libcore) and UnicodeStrSlice (libunicode) - functionality formerly in StrSlice that relied upon Unicode functionality from Char is now in UnicodeStrSlice - words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right - also moved Words type alias into libunicode because words method is in UnicodeStrSlice - unified Unicode tables from libcollections, libcore, and libregex into libunicode - updated unicode.py in src/etc to generate aforementioned tables - generated new tables based on latest Unicode data - added UnicodeChar and UnicodeStrSlice traits to prelude - libunicode is now the collection point for the std::char module, combining the libunicode functionality with the Char functionality from libcore - thus, moved doc comment for char from core::char to unicode::char - libcollections remains the collection point for std::str The Unicode-aware functions that previously lived in the Char and StrSlice traits are no longer available to programs that only use libcore. To regain use of these methods, include the libunicode crate and use the UnicodeChar and/or UnicodeStrSlice traits: extern crate unicode; use unicode::UnicodeChar; use unicode::UnicodeStrSlice; use unicode::Words; // if you want to use the words() method NOTE: this does not impact programs that use libstd, since UnicodeChar and UnicodeStrSlice have been added to the prelude. closes #15224 [breaking-change] 2014-06-30 16:04:10 -05:00			`#![crate_type = "rlib"]`
Use https URLs to refer to rust-lang.org where appropriate. Also fixes a few outdated links. 2015-08-09 16:15:05 -05:00			`#![doc(html_logo_url = "https://www.rust-lang.org/logos/rust-logo-128x128-blk-v2.png",`
libs: Move favicon URLs to HTTPS Helps prevent mixed content warnings if accessing docs over HTTPS. Closes #25459 2015-05-15 18:04:01 -05:00			`html_favicon_url = "https://doc.rust-lang.org/favicon.ico",`
Use https URLs to refer to rust-lang.org where appropriate. Also fixes a few outdated links. 2015-08-09 16:15:05 -05:00			`html_root_url = "https://doc.rust-lang.org/nightly/",`
			`html_playground_url = "https://play.rust-lang.org/",`
rustdoc: Added issue_tracker_base_url annotations to crates 2015-08-16 10:22:50 -05:00			`issue_tracker_base_url = "https://github.com/rust-lang/rust/issues/",`
librustc_unicode: deny warnings in doctests 2015-11-03 10:12:25 -06:00			`test(no_crate_inject, attr(allow(unused_variables), deny(warnings))))]`
mk: Move from `-D warnings` to `#![deny(warnings)]` This commit removes the `-D warnings` flag being passed through the makefiles to all crates to instead be a crate attribute. We want these attributes always applied for all our standard builds, and this is more amenable to Cargo-based builds as well. Note that all `deny(warnings)` attributes are gated with a `cfg(stage0)` attribute currently to match the same semantics we have today 2016-01-21 17:26:19 -06:00			`#![cfg_attr(not(stage0), deny(warnings))]`
Add libunicode; move unicode functions from core - created new crate, libunicode, below libstd - split Char trait into Char (libcore) and UnicodeChar (libunicode) - Unicode-aware functions now live in libunicode - is_alphabetic, is_XID_start, is_XID_continue, is_lowercase, is_uppercase, is_whitespace, is_alphanumeric, is_control, is_digit, to_uppercase, to_lowercase - added width method in UnicodeChar trait - determines printed width of character in columns, or None if it is a non-NULL control character - takes a boolean argument indicating whether the present context is CJK or not (characters with 'A'mbiguous widths are double-wide in CJK contexts, single-wide otherwise) - split StrSlice into StrSlice (libcore) and UnicodeStrSlice (libunicode) - functionality formerly in StrSlice that relied upon Unicode functionality from Char is now in UnicodeStrSlice - words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right - also moved Words type alias into libunicode because words method is in UnicodeStrSlice - unified Unicode tables from libcollections, libcore, and libregex into libunicode - updated unicode.py in src/etc to generate aforementioned tables - generated new tables based on latest Unicode data - added UnicodeChar and UnicodeStrSlice traits to prelude - libunicode is now the collection point for the std::char module, combining the libunicode functionality with the Char functionality from libcore - thus, moved doc comment for char from core::char to unicode::char - libcollections remains the collection point for std::str The Unicode-aware functions that previously lived in the Char and StrSlice traits are no longer available to programs that only use libcore. To regain use of these methods, include the libunicode crate and use the UnicodeChar and/or UnicodeStrSlice traits: extern crate unicode; use unicode::UnicodeChar; use unicode::UnicodeStrSlice; use unicode::Words; // if you want to use the words() method NOTE: this does not impact programs that use libstd, since UnicodeChar and UnicodeStrSlice have been added to the prelude. closes #15224 [breaking-change] 2014-06-30 16:04:10 -05:00			`#![no_std]`
core: Split apart the global `core` feature This commit shards the broad `core` feature of the libcore library into finer grained features. This split groups together similar APIs and enables tracking each API separately, giving a better sense of where each feature is within the stabilization process. A few minor APIs were deprecated along the way: * Iterator::reverse_in_place * marker::NoCopy 2015-06-09 13:18:03 -05:00
			`#![feature(core_char_ext)]`
			`#![feature(lang_items)]`
			`#![feature(staged_api)]`
std: Change `encode_utf{8,16}` to return iterators Currently these have non-traditional APIs which take a buffer and report how much was filled in, but they're not necessarily ergonomic to use. Returning an iterator which also exposes an underlying slice shouldn't result in any performance loss as it's just a lazy version of the same implementation, and it's also much more ergonomic! cc #27784 2016-03-11 13:01:46 -06:00			`#![feature(unicode)]`
Add libunicode; move unicode functions from core - created new crate, libunicode, below libstd - split Char trait into Char (libcore) and UnicodeChar (libunicode) - Unicode-aware functions now live in libunicode - is_alphabetic, is_XID_start, is_XID_continue, is_lowercase, is_uppercase, is_whitespace, is_alphanumeric, is_control, is_digit, to_uppercase, to_lowercase - added width method in UnicodeChar trait - determines printed width of character in columns, or None if it is a non-NULL control character - takes a boolean argument indicating whether the present context is CJK or not (characters with 'A'mbiguous widths are double-wide in CJK contexts, single-wide otherwise) - split StrSlice into StrSlice (libcore) and UnicodeStrSlice (libunicode) - functionality formerly in StrSlice that relied upon Unicode functionality from Char is now in UnicodeStrSlice - words, is_whitespace, is_alphanumeric, trim, trim_left, trim_right - also moved Words type alias into libunicode because words method is in UnicodeStrSlice - unified Unicode tables from libcollections, libcore, and libregex into libunicode - updated unicode.py in src/etc to generate aforementioned tables - generated new tables based on latest Unicode data - added UnicodeChar and UnicodeStrSlice traits to prelude - libunicode is now the collection point for the std::char module, combining the libunicode functionality with the Char functionality from libcore - thus, moved doc comment for char from core::char to unicode::char - libcollections remains the collection point for std::str The Unicode-aware functions that previously lived in the Char and StrSlice traits are no longer available to programs that only use libcore. To regain use of these methods, include the libunicode crate and use the UnicodeChar and/or UnicodeStrSlice traits: extern crate unicode; use unicode::UnicodeChar; use unicode::UnicodeStrSlice; use unicode::Words; // if you want to use the words() method NOTE: this does not impact programs that use libstd, since UnicodeChar and UnicodeStrSlice have been added to the prelude. closes #15224 [breaking-change] 2014-06-30 16:04:10 -05:00
			`mod tables;`
			`mod u_str;`
std: Stabilize more of the `char` module This commit performs another pass over the `std::char` module for stabilization. Some minor cleanup is performed such as migrating documentation from libcore to libunicode (where the `std`-facing trait resides) as well as a slight reorganiation in libunicode itself. Otherwise, the stability modifications made are: * `char::from_digit` is now stable * `CharExt::is_digit` is now stable * `CharExt::to_digit` is now stable * `CharExt::to_{lower,upper}case` are now stable after being modified to return an iterator over characters. While the implementation today has not changed this should allow us to implement the full set of case conversions in unicode where some characters can map to multiple when doing an upper or lower case mapping. * `StrExt::to_{lower,upper}case` was added as unstable for a convenience of not having to worry about characters expanding to more characters when you just want the whole string to get into upper or lower case. This is a breaking change due to the change in the signatures of the `CharExt::to_{upper,lower}case` methods. Code can be updated to use functions like `flat_map` or `collect` to handle the difference. [breaking-change] 2015-03-05 20:23:57 -06:00			`pub mod char;`
add Graphemes iterator; tidy unicode exports - Graphemes and GraphemeIndices structs implement iterators over grapheme clusters analogous to the Chars and CharOffsets for chars in a string. Iterator and DoubleEndedIterator are available for both. - tidied up the exports for libunicode. crate root exports are now moved into more appropriate module locations: - UnicodeStrSlice, Words, Graphemes, GraphemeIndices are in str module - UnicodeChar exported from char instead of crate root - canonical_combining_class is exported from str rather than crate root Since libunicode's exports have changed, programs that previously relied on the old export locations will need to change their `use` statements to reflect the new ones. See above for more information on where the new exports live. closes #7043 [breaking-change] 2014-07-11 16:23:45 -05:00
Refactor low-level UTF-16 decoding. * Rename `utf16_items` to `decode_utf16`. "Items" is meaningless. * Move it to `rustc_unicode::char`, exposed in `std::char`. * Generalize it to any `u16` iterable, not just `&[u16]`. * Make it yield `Result` instead of a custom `Utf16Item` enum that was isomorphic to `Result`. This enable using the `FromIterator for Result` impl. * Add a `REPLACEMENT_CHARACTER` constant. * Document how `result.unwrap_or(REPLACEMENT_CHARACTER)` replaces `Utf16Item::to_char_lossy`. 2015-08-13 11:39:46 -05:00			`#[allow(deprecated)]`
add Graphemes iterator; tidy unicode exports - Graphemes and GraphemeIndices structs implement iterators over grapheme clusters analogous to the Chars and CharOffsets for chars in a string. Iterator and DoubleEndedIterator are available for both. - tidied up the exports for libunicode. crate root exports are now moved into more appropriate module locations: - UnicodeStrSlice, Words, Graphemes, GraphemeIndices are in str module - UnicodeChar exported from char instead of crate root - canonical_combining_class is exported from str rather than crate root Since libunicode's exports have changed, programs that previously relied on the old export locations will need to change their `use` statements to reflect the new ones. See above for more information on where the new exports live. closes #7043 [breaking-change] 2014-07-11 16:23:45 -05:00			`pub mod str {`
Remove all unstable deprecated functionality This commit removes all unstable and deprecated functions in the standard library. A release was recently cut (1.3) which makes this a good time for some spring cleaning of the deprecated functions. 2015-08-11 19:27:05 -05:00			`pub use u_str::{UnicodeStr, SplitWhitespace};`
std: Remove deprecated functionality from 1.5 This is a standard "clean out libstd" commit which removes all 1.5-and-before deprecated functionality as it's now all been deprecated for at least one entire cycle. 2015-12-02 19:07:29 -06:00			`pub use u_str::{utf8_char_width, is_utf16};`
			`pub use u_str::{Utf16Encoder};`
add Graphemes iterator; tidy unicode exports - Graphemes and GraphemeIndices structs implement iterators over grapheme clusters analogous to the Chars and CharOffsets for chars in a string. Iterator and DoubleEndedIterator are available for both. - tidied up the exports for libunicode. crate root exports are now moved into more appropriate module locations: - UnicodeStrSlice, Words, Graphemes, GraphemeIndices are in str module - UnicodeChar exported from char instead of crate root - canonical_combining_class is exported from str rather than crate root Since libunicode's exports have changed, programs that previously relied on the old export locations will need to change their `use` statements to reflect the new ones. See above for more information on where the new exports live. closes #7043 [breaking-change] 2014-07-11 16:23:45 -05:00			`}`
Correctly map upper-case Sigma to lower-case in word-final position. Fix #26035. 2015-06-06 05:34:24 -05:00
			`// For use in libcollections, not re-exported in libstd.`
			`pub mod derived_property {`
			`pub use tables::derived_property::{Cased, Case_Ignorable};`
			`}`
libsyntax: accept only whitespace with the PATTERN_WHITE_SPACE property This aligns with unicode recommendations and should be stable for all future unicode releases. See http://unicode.org/reports/tr31/#R3. This renames `libsyntax::lexer::is_whitespace` to `is_pattern_whitespace` so potentially breaks users of libsyntax. 2015-11-11 20:43:43 -06:00
			`// For use in libsyntax`
			`pub mod property {`
			`pub use tables::property::Pattern_White_Space;`
			`}`