Technical Reports |
Version | Unicode 3.2.0 |
Authors | Members of the Editorial Committee |
Date | 2002-03-27 |
This Version | http://www.unicode.org/unicode/reports/tr28/tr28-3 |
Previous Version | N/A |
Latest Version | http://www.unicode.org/unicode/reports/tr28 |
Tracking Number | 3 |
This document defines Version 3.2 of the Unicode Standard.
This document has been reviewed by Unicode members and other interested parties, and has been approved by the Unicode Technical Committee as a Unicode Standard Annex. It is a stable document and may be used as reference material or cited as a normative reference from another document.
A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, but is published as a separate document. Note that conformance to a version of the Unicode Standard includes conformance to its Unicode Standard Annexes. The version number of a UAX document corresponds to the version number of the Unicode Standard at the last point that the UAX document was updated.
A list of current Unicode Technical Reports is found on http://www.unicode.org/unicode/reports/. For more information about versions of the Unicode Standard, see http://www.unicode.org/unicode/standard/versions/.
The References provide related information that is useful in understanding this document. Please mail corrigenda and other comments to the author(s).
Unicode 3.2 is a minor version of the Unicode Standard. It overrides certain features of Unicode 3.1, and adds a significant number of coded characters.
The Unicode Consortium. The Unicode Standard, Version 3.2.0 is defined by The Unicode Standard, Version 3.0 (Reading, MA, Addison-Wesley, 2000. ISBN 0-201-61633-5), as amended by the Unicode Standard Annex #27: Unicode 3.1 (http://www.unicode.org/reports/tr27/) and by the Unicode Standard Annex #28: Unicode 3.2 (http://www.unicode.org/reports/tr28/). |
The Unicode Standard, Version 3.2.0 is defined by the following list. The version numbering and the role of each component are explained in Versions of The Unicode Standard. The symbols in the change status column are explained in the key below. A summary of modifications in the Unicode Character Database for this version can be found in UnicodeCharacterDatabase-3.2.0.html, together with a list of which data files contain normative vs. informative data.
N New in this release D Data change (possibly also format/text change) F Data format change (possibly also text change) T Text annotation change - Unchanged
The list of contributory data files constituting the Unicode Standard, Version 3.2 can also be found online at Enumerated Versions.
The primary feature of Unicode 3.2 is the addition of 1016 new encoded characters. These additions consist of several Philippine scripts, a large collection of mathematical symbols, and small sets of other letters and symbols.
All of the newly encoded characters in Unicode 3.2 are additions to the Basic Multilingual Plane (BMP).
Complete introductions to the newly encoded scripts and symbols can be found in Article IV, Block Descriptions, below.
Unicode 3.2 also features amended contributory data files, to bring the data files up to date against the expanded repertoire of characters. A summary of the revisions to the data files can be found in Article VII, Unicode Character Database Changes.
All outstanding errata and corrigenda to the Unicode Standard are included in this specification. Major corrigenda having a bearing on conformance to the standard are listed in Article II, Conformance. Other minor errata are listed in Article VI, Errata.
Most notable among the corrigenda to the Standard is a further tightening of the definition of UTF-8, to eliminate irregular UTF-8 and to bring the Unicode specification of UTF-8 more completely into line with other specifications of UTF-8.
The former UTR #21, Case Mappings has been upgraded in status to a Unicode Standard Annex in Unicode 3.2. This means that UAX #21, Case Mappings is now formally a part of the Unicode Standard.
The sections of this document are referred to as “articles” to prevent confusion with references to sections of The Unicode Standard, Version 3.0. In addition, the articles in this document are numbered with Roman numerals, to highlight the distinction. The word “section” always refers to sections of The Unicode Standard, Version 3.0 or to a new section of the standard which is added by this document. Page numbers also refer to The Unicode Standard, Version 3.0.
New or replacement text for the standard is indicated with underlined text, when this new text is a corrigendum of an existing section of the standard.
Deleted text from the standard is indicated with struck-through
text.
In instances where entire new sections or subsections are to be added to the standard, as for the block descriptions for newly encoded scripts or symbol sets, new section numbers are provided that interleave reasonably with the existing sections of the published Unicode 3.0 book. And for these added sections, the text is not underlined, since the entire sections are new.
In this document, unambiguous dates of the current common era, such as 1999, are unlabeled. In cases of ambiguity, CE is used. Dates before the common era are labeled with BCE.
The definition of transformation formats such as UTF-8 allowed conformant processes to interpret certain sequences called irregular sequences. These irregular sequences are those that would be produced by transforming supplementary code points as if they were a sequence of two surrogate code points.
To tighten the definitions, in Unicode 3.2 such irregular sequences are now illegal.
Note: Some implementations of UTF-8 might still interpret irregular sequences; for those, a separate compatibility encoding scheme, to be distinguished from UTF-8, may be used. See Unicode Technical Report #26, “Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8).” However, CESU-8 is not intended nor recommended as an encoding used for open information exchange.
Terminology to distinguish ill-formed, illegal, and irregular code unit sequences is no longer needed. There are no irregular code unit sequences, and thus all ill-formed code unit sequences are illegal. It is illegal to emit or interpret any ill-formed code unit sequence. Unicode 4.0 will revise the terminology and conformance clauses in light of this. For Unicode 3.2, only the minimal changes required of the text are noted here.
Change C12 in Unicode 3.1 to:
C12 | (a) When a process generates data in a Unicode
Transformation Format, it shall not emit ill-formed code unit sequences. (b) When a process interprets data in a Unicode Transformation Format, it shall treat (c) A conformant process shall not interpret |
Change the fifth note after C12 in Unicode 3.1 to:
Change Table 3.1B after C12 in Unicode 3.1 by splitting the row U+1000..U+FFFF to exclude the surrogate code points:
Code Points | 1st Byte | 2nd Byte | 3rd Byte | 4th Byte |
---|---|---|---|---|
U+0000..U+007F | 00..7F | |||
U+0080..U+07FF | C2..DF | 80..BF | ||
U+0800..U+0FFF | E0 | A0..BF | 80..BF | |
U+1000..U+CFFF | E1..EC | 80..BF | 80..BF | |
U+D000..U+D7FF | ED | 80..9F | 80..BF | |
U+D800..U+DFFF | ill-formed | |||
U+E000..U+FFFF | EE..EF | 80..BF | 80..BF | |
U+10000..U+3FFFF | F0 | 90..BF | 80..BF | 80..BF |
U+40000..U+FFFFF | F1..F3 | 80..BF | 80..BF | 80..BF |
U+100000..U+10FFFF | F4 | 80..8F | 80..BF | 80..BF |
The text of D21 is replaced by the following text:
D21 Compatibility decomposable character: a character whose compatibility decomposition is not identical to its canonical decomposition. It may also be known as a compatibility precomposed character or a compatibility composite character.
Add the following new text after D23:
D23a Canonical decomposable character: a character which is not identical to its canonical decomposition. It may also be known as a canonical precomposed character or a canonical composite character.
The character U+2060 has been added to the standard to allow unambiguous expression of the word-joining semantics. U+2060 WORD JOINER is now the preferred character to express the word-joining semantics implied by the ZWNBSP. The availability of U+2060 makes it unnecessary to use U+FEFF as a zero-width non-breaking space, allowing U+FEFF to be used solely with the semantic of BOM. For more information, see the subsection on “Word Joiner” in Section 13.2, Layout Controls in this document.
Note: Implementers are strongly encouraged to use word joiner in those circumstances whenever word joining semantics is intended.
A number of characters which have special character properties have been added in the Unicode Standard, Version 3.2. To reflect this, the following changes are made to the special character properties listing, on pages 48-50 of The Unicode Standard, Version 3.0:
In the entry for “Line boundary control”, add:
205F MEDIUM MATHEMATICAL SPACE
2060 WORD JOINER
Change the name of the “Joining” entry to “Cursive joining and ligation control”.
Add a new entry called “Grapheme joining” after the renamed entry for “Cursive joining and ligation control” and add to that new entry:
034F COMBINING GRAPHEME JOINER
Add a new entry called “Mathematical expression formatting” after the entry “Bidirectional ordering” and add to that new entry:
2061 FUNCTION APPLICATION
2062 INVISIBLE TIMES
2063 INVISIBLE SEPARATOR
Change the name of the “Alternate formatting” entry to “Deprecated alternate formatting”.
Change the name of the “Syriac abbreviation” entry to “Prefixed format control” and add to that entry:
06DD ARABIC END OF AYAH
Change the name of the “Indic dead-character formation” entry to “Brahmi-derived script dead-character formation” and add to that entry:
1714 TAGALOG SIGN VIRAMA
1734 HANUNOO SIGN PAMUDPOD
Change the name of the “Mongolian variant selectors” entry to “Mongolian variation selectors”.
After the “Mongolian variation selectors” entry add a new entry “Generic variation selectors” and add to that new entry:
FE00 VARIATION SELECTOR-1
FE01 VARIATION SELECTOR-2
FE02 VARIATION SELECTOR-3
FE03 VARIATION SELECTOR-4
FE04 VARIATION SELECTOR-5
FE05 VARIATION SELECTOR-6
FE06 VARIATION SELECTOR-7
FE07 VARIATION SELECTOR-8
FE08 VARIATION SELECTOR-9
FE09 VARIATION SELECTOR-10
FE0A VARIATION SELECTOR-11
FE0B VARIATION SELECTOR-12
FE0C VARIATION SELECTOR-13
FE0D VARIATION SELECTOR-14
FE0E VARIATION SELECTOR-15
FE0F VARIATION SELECTOR-16
Formally speaking, combining marks apply to the preceding grapheme cluster. In most cases, this is the same as applying to the preceding base character. However, in two circumstances there is a difference:
Hangul Syllables. Where a grapheme cluster contains a Hangul syllable, the combining mark applies to the entire syllable. For example, in the following sequence the grave is applied to the entire Hangul syllable, not just the last jamo:
Enclosing Combining Marks. These marks enclose the entire preceding grapheme cluster. For example, in the following sequence the entire Hangul syllable is circled, not just part of it:
This is also true of grapheme clusters composed of elements linked by a Grapheme_Link or combining grapheme joiner. For example, the entire conjunct is circled in the following sequence:
On the other hand, where elements are linked by a Grapheme_Link or combining grapheme joiner, non-enclosing combining marks only apply to the last base character. For example, in the following sequence the nukta applies to the immediately preceding ddha, not to the entire cluster:
For more information, see the subsection on “Combining Grapheme Joiner” in Section 13.2, Layout Controls in this document.
The following text replaces the text and tables for this section on pages 52-53 of The Unicode Standard, Version 3.0:
The Unicode Standard contains both a large set of precomposed modern Hangul syllables and a set of conjoining Hangul jamo, which can be used to encode archaic syllable blocks as well as modern syllable blocks. This section describes how to:
For more information, see the “Hangul Syllables” and “Hangul Jamo” subsections in Section 10.4, Hangul in The Unicode Standard, Version 3.0. Hangul syllables are a special case of grapheme clusters.
The jamo characters can be classified into three sets of characters: choseong (leading consonants, or syllable-initial characters), jungseong (vowels, or syllable-peak characters), and jongseong (trailing consonants, or syllable-final characters). In the following discussion, these jamo are abbreviated as L (leading consonant), V (vowel), and T (trailing consonant); syllable breaks are shown by middle dots “·”; non-syllable breaks are shown by “×”, combining marks are shown by M, and non-jamo are shown by X.
In the following discussion, a syllable refers to a sequence of Korean characters that should be grouped into a single cell for display. This is different from a precomposed Hangul syllable, which consists of any of the characters in the range U+AC00..U+D7A3. Note that a syllable may contain a precomposed Hangul syllable plus other characters.
In rendering, a sequence of jamos is displayed as a series of syllable blocks. The following rules specify how to divide up an arbitrary sequence of jamos (including nonstandard sequences) into these syllable blocks. In these rules, a choseong filler (Lf ) is treated as a choseong character, and a jungseong filler (Vf ) is treated as a jungseong.
The precomposed Hangul syllables are of two types: LV or LVT. In determining the syllable boundaries, the LV behave as if they were a sequence of jamo L V, and the LVT behave as if they were a sequence of jamo L V T.
Within any sequence of characters, a syllable break never occurs between the pairs of characters shown in Table 3-5. In all other cases, there is a syllable break before and after any jamo or precomposed Hangul syllable. Note that like other characters, any combining mark between two conjoining jamos prevents the jamos from forming a syllable.
Table 3-5. Hangul Syllable No-Break Rules
Do Not Break Between | Examples | |
L | L, V, or precomposed Hangul syllable |
L × L L× V L × LV L × LVT |
V or LV | V or T | V × V V × T LV × V LV × T |
T or LVT | T | T × T LVT × T |
Jamo or precomposed Hangul syllable |
Combining marks | L × M V × M T × M LV × M LVT × M |
Note that even in normalization form NFC, a syllable may contain a precomposed Hangul syllable in the middle. An example is “L LVT T”. Each well-formed modern Hangul syllable, however, can be represented in the form L V T? (that is one L, one V and optionally one T), and is a single character in NFC.
For information on the behavior of Hangul compatibility jamo in syllables, see Section 10.4, Hangul in The Unicode Standard, Version 3.0.
A standard Korean syllable block is composed of a sequence of one or more L followed by a sequence of one or more V and optionally a sequence of zero or more T. A sequence of nonstandard syllable blocks can be transformed into a sequence of standard Korean syllable blocks by inserting choseong fillers (Lf ) and jungseong fillers (Vf ).
Using regular expression notation, a standard Korean syllable is thus of the form:
L+ V+ T*
The transformation of a string of text into standard Korean syllables is performed by determining the syllable breaks as explained in the subsection on “Syllable Boundaries” earlier in this section, then inserting one or two fillers as necessary to transform each syllable into a standard Korean syllable. Thus:
L ^V → L Vf ^V
^L V → ^L Lf V
^V T → ^V Lf Vf T
where ^X indicates a character that is not X, or the absence of a character.
Examples. In Table 3-6, the first row shows syllable breaks in a standard sequence, the second row shows syllable breaks in a nonstandard sequence, and the third row shows how the sequence in the second row could be transformed into standard form by inserting fillers into each syllable.
Table 3-6. Syllable Break Examples
No. |
Sequence | Sequence with Syllable Breaks Marked | |
1 |
LVTLVLVLVfLfVLfVfT |
→ | LVT · LV · LV · LVf · LfV · LfVfT |
2 |
LLTTVVTTVVLLVV | → | LL · TT · VVTT · VV · LL · LLVV |
3 |
LLTTVVTTVVLLVV | → | LLVf · LfVfTT · LfVVTT · LfVV · LLVf · LLVV |
Remove the entry for U+06DD ARABIC END OF AYAH from Table 4-3, Combining Classes on page 80 of The Unicode Standard, Version 3.0.
In Corrigendum #3 the canonical mapping for U+F951 has been corrected. For more information, see Unicode Standard Annex #15, “Unicode Normalization Forms”.
Add the following text to page 18 of The Unicode Standard, Version 3.0 just before the subsection on “Convertibility”:
Decompositions
Precomposed characters are formally known as decomposables, because they have decompositions to one or more other characters. There are two types of decompositions:
Thus there are three types of characters, based on their decomposition behavior:
The following figure illustrates these three types. The solid arrows indicate canonical decompositions, and the dotted arrows indicate compatibility decompositions. If an arrow loops back and points to the character itself, that indicates that there is no decomposition of that type (other than in the trivial sense of a character “decomposing” to itself).
The figure illustrates two important things to keep in mind:
For more precise definitions of some of these terms, see Chapter 3, Conformance in The Unicode Standard, Version 3.0.
Nondecomposables
|
|
---|---|
Canonical Decomposables
|
Compatibility Decomposables
|
Add the following text after bullet item 6 on page 125 of The Unicode
Standard, Version 3.0:
The rules are applied in order. That is, there is an implicit “otherwise” at
the front of each rule following the first. It is possible to construct
alternate sets of such rules that are fully equivalent; that is, they have the
same effect.
Note: The rules for default grapheme cluster boundaries, default word boundaries and default sentence boundaries are in the process of being superseded by a new Unicode Technical Report #29, Text Boundaries.
Note: The numbering used here for block descriptions and revised text follows The Unicode Standard, Version 3.0 for ease of cross-reference.
Invisible Operators. In mathematics some operators or punctuation are often implied, but not displayed. U+2063 INVISIBLE SEPARATOR or invisible comma is intended for use in index expressions and other mathematical notation where two adjacent variables form a list and are not implicitly multiplied. In mathematical notation, commas are not always explicitly present, but need to be indicated for symbolic calculation software to help it disambiguate a sequence from a multiplication. For example, the double ij subscript in the variable aij means ai, j — that is, the i and j are separate indices and not a single variable with the name ij or even the product of i and j. Accordingly to represent the implied list separation in the subscript ij one can insert a nondisplaying invisible separator between the i and the j. In addition, use of the invisible comma would hint to a math layout program to typeset a small space between the variables.
Similarly an expression like mc2 implies that the mass m multiplies the square of the speed c. To represent the implied multiplication in mc2, one inserts a nondisplaying U+2061 INVISIBLE TIMES between the m and the c. A related case is the use of U+2062 FUNCTION APPLICATION for an implied function dependence as in f(x + y). To indicate that this is the function f of the quantity x + y and not the expression fx + fy, one can insert the nondisplaying function application symbol between the f and the left parenthesis.
Another example is the expression f ij(cos(ab)), which means the same as fij(cos(a×b)), where × represents multiplication, not the cross product. Note that the spacing between characters may also depend on whether the adjacent variables are part of a list or are to be concatenated, that is, multiplied.
A more complete discussion of mathematical notation can be found in Proposed Draft Unicode Technical Report #25, “Unicode Support for Mathematics.”
Commercial Minus. U+2052 COMMERCIAL MINUS SIGN is used in commercial or tax related forms or publications in several European countries, including Germany and Scandinavia. The string “./.” appears to be used as a fallback representation for this character.
The symbol may also appear as a marginal note in letters, denoting enclosures. One variation replaces the top dot with a digit indicating the number of enclosures.
An additional usage of the sign appears in the Finno-Ugric Phonetic Alphabet (FUPA), where it marks a structurally-related borrowed element of different pronunciation. In Finland and a number of other European countries, the dingbats and are used for “correct” and “incorrect” respectively in marking a student’s paper. This contrasts with American practice, for example, where and can be used for “correct” and “incorrect” respectively in the same context.
On page 155 of The Unicode Standard, Version 3.0 update the first full paragraph as follows:
This block encodes punctuation marks and symbols primarily
used by writing systems that employ Han ideographs. Most of these characters are
found in East Asian standards.
Add a new paragraph on page 155 of The Unicode Standard, Version 3.0 to follow the paragraph on U+3006:
U+3008, U+3009 angle brackets are unambiguously wide. The Unicode Standard encodes different characters for use in other contexts, such as mathematics. There are other characters in this block that have the same characteristics, including double angle brackets, tortoise shell brackets, and white square brackets.
With Unicode 3.0 and the concurrent second edition of ISO/IEC 10646-1, the representative glyphs for U+03C6 GREEK LETTER SMALL PHI and U+03D5 GREEK PHI SYMBOL were swapped. In ordinary Greek text, the character U+03C6 is used exclusively, although this characters has considerably glyphic variation, sometimes represented with a glyph more like the representative glyph shown for U+03C6 (the “loopy” form) and less often with a glyph more like the representative glyph shown for U+03D5 (the “straight” form).
For mathematical and technical use, the straight form of the small phi is an important symbol and needs to be consistently distinguishable from the loopy form. The straight form phi glyph is used as the representative glyph for the symbol phi at U+03D5 to satisfy this distinction.
The reversed assignment of representative glyphs in versions of the Unicode Standard prior to Unicode 3.0 had the problem that the character explicitly identified as the mathematical symbol did not have the straight form of the character that is the preferred glyph for that use. Furthermore, it made it unnecessarily difficult for general purpose fonts supporting ordinary Greek text to also add support for Greek letters used as mathematical symbols. This resulted from the fact that many of those fonts already used the loopy form glyph for U+03C6, as preferred for Greek body text; to support the phi symbol as well, they would have had to disrupt glyph choices already optimized for Greek text.
When mapping symbol sets or SGML entities to the Unicode Standard, it is important to make sure that codes or entities that require the straight form of the phi symbol be mapped to U+03D5 and not to U+03C6. Mapping to the latter should be reserved for codes or entities that represent the small phi as used in ordinary Greek text.
Fonts used primarily for Greek text may use either glyph form for U+03C6, but fonts that also intend to support technical use of the Greek letters should use the loopy form to ensure appropriate contrast with the straight form used for U+03D5.
End of Ayah. U+06DD ARABIC END OF AYAH graphically encloses a sequence of zero or more digits (of General Category Nd) that follow it in the data stream. The enclosure terminates with any non-digit. For behavior of a similar prefixed formatting control, see the discussion of the Syriac Abbreviation Mark in Section 8.3, Syriac in The Unicode Standard, Version 3.0.
Characters Whose Use is Discouraged. The use of the following characters is discouraged; they are being considered for possible deprecation in a future version of the Standard. These characters should be avoided in the normal representation of Khmer text:
17A3 KHMER INDEPENDENT VOWEL QAQ
17A4 KHMER INDEPENDENT VOWEL QAA
17B4 KHMER VOWEL INHERENT AQ
17B5 KHMER VOWEL INHERENT AA
17D3 KHMER SIGN BATHAMASAT
17D8 KHMER SIGN BEYYAL
For transliteration of Pali/Sanskrit, U+17A2 KHMER LETTER QA is recommended instead of U+17A3 KHMER INDEPENDENT VOWEL QAQ, and the sequence <U+17A2 KHMER LETTER QA, U+17B6 KHMER VOWEL SIGN AA> is recommended instead of U+17A4 KHMER INDEPENDENT VOWEL QAA.
The use of U+17D3 KHMER SIGN BATHAMASAT is not recommended for representation of Khmer lunar dates; a separate proposal for the full representation of Khmer lunar dates is under development.
U+17D8 KHMER SIGN BEYYAL is not recommended for use in the Khmer word meaning, “etc.”. It should be spelled out with a sequence of signs and letters instead.
Combined Vowels. The Khmer language uses two dependent vowel signs whose Unicode representation consists of a sequence of two code points. These are khmer vowel sign srak om, represented by the sequence <U+17BB KHMER VOWEL SIGN U, U+17C6 KHMER SIGN NIKAHIT> and khmer vowel sign srak aam, represented by the sequence <U+17B6 KHMER VOWEL SIGN AA, U+17C6 KHMER SIGN NIKAHIT>. The nikahit represents the final nasalization of the vowel, shown by the “m” in the transliteration. These dependent vowels are treated as units, for the purposes of enumeration of the “letters” of Khmer, and most importantly for collation. Having these vowels represented by a sequence of two Unicode code points may be unexpected for Khmer implementers. It is important, therefore, to ensure that these sequences are treated as units when implementing Khmer.
Subscript Letters. The Unicode encoding of the Khmer script uses an independent (and invisible) coeng sign to indicate that the following consonant is subscripted, by analogy with the virama model employed for representing conjuncts in Indian scripts. Subscripted independent vowels are encoded in the same manner. This approach uses an artificial coeng sign character which does not exist as a letter or sign in the Khmer script, and therefore departs from the ordinary way that Khmer is conceived of and taught to native Khmer speakers. Consequently, the encoding may not be intuitive to a native user of the Khmer writing system. Ordinarily, the units such as khmer consonant coeng ka are conceived of as independent and unitary subscript letters, rather than as a result of conjunct formation.
To aid Khmer script users, a full listing of all the Khmer subscript letters has been provided in the table, “Additional Khmer Character Names”, together with appropriate names for them which follow preferred Khmer practice. While the Unicode encoding represents both the subscripts and the combined vowel letters with a pair of code points, they must be treated as a unit for most processing purposes. In other words they must function as if they had been encoded as a single character. The combined vowel characters are also included in this list, and should also be treated as a unit in processing.
A full Khmer script chart is also provided, showing all of the Khmer characters preferred for modern Khmer usage, including the subscripts and combined vowels. This chart is better for didactic purposes in representing the Khmer script and its Unicode encoding. By contrast, the main Unicode code chart does not reflect the modern reading rules for Khmer, and thereby can give a misleading picture of the structure of the script.
Consonants | |||||||||
---|---|---|---|---|---|---|---|---|---|
1780 |
1781 |
1782 |
1783 |
1784 |
1785 |
1786 |
1787 |
1788 |
1789 |
178A |
178B |
178C |
178D |
178E |
178F |
1790 |
1791 |
1792 |
1793 |
1794 |
1795 |
1796 |
1797 |
1798 |
1799 |
179A |
179B |
179C |
179D |
179E |
179F |
17A0 |
17A1 |
17A2 |
|||||
Independent Vowels | |||||||||
17A5 |
17A6 |
17A7 |
17A9 |
17AA |
17AB |
17AC |
17AD |
17AE |
17AF |
17B0 |
17B1 |
17B3 |
|||||||
Dependent Vowel Signs | |||||||||
17B6 |
17B7 |
17B8 |
17B9 |
17BA |
17BB |
17BC |
17BD |
17BE |
17BF |
17C0 |
17C1 |
17C2 |
17C3 |
17C4 |
17C5 |
17BB 17C6 |
17C6 |
17B6 17C6 |
17C7 |
Subscript Characters | |||||||||
17D2 1780 |
17D2 1781 |
17D2 1782 |
17D2 1783 |
17D2 1784 |
17D2 1785 |
17D2 1786 |
17D2 1787 |
17D2 1788 |
17D2 1789 |
17D2 178A |
17D2 178B |
17D2 178C |
17D2 178D |
17D2 178E |
17D2 178F |
17D2 1790 |
17D2 1791 |
17D2 1792 |
17D2 1793 |
17D2 1794 |
17D2 1795 |
17D2 1796 |
17D2 1797 |
17D2 1798 |
17D2 1799 |
17D2 179A |
17D2 179B |
17D2 179C |
17D2 179D |
17D2 179E |
17D2 179F |
17D2 17A0 |
17D2 17A2 |
17D2 17A7 |
17D2 17AB |
17D2 17AF |
|||
Various Signs | |||||||||
17C8 |
17CB |
17CC |
17CD |
17CE |
17CF |
17D0 |
17D1 |
17D4 |
17D5 |
17D6 |
17D7 |
17D9 |
17DA |
17DC |
17DB |
17C9 |
17CA |
||
Digits | |||||||||
17E0 |
17E1 |
17E2 |
17E3 |
17E4 |
17E5 |
17E6 |
17E7 |
17E8 |
17E9 |
Glyph | Code | Name |
17BB 17C6 | khmer vowel sign srak om | |
17B6 17C6 | khmer vowel sign srak am | |
17D2 1780 | khmer consonant sign coeng ka | |
17D2 1781 | khmer consonant sign coeng kha | |
17D2 1782 | khmer consonant sign coeng ko | |
17D2 1783 | khmer consonant sign coeng kho | |
17D2 1784 | khmer consonant sign coeng ngo | |
17D2 1785 | khmer consonant sign coeng ca | |
17D2 1786 | khmer consonant sign coeng cha | |
17D2 1787 | khmer consonant sign coeng co | |
17D2 1788 | khmer consonant sign coeng cho | |
17D2 1789 | khmer consonant sign coeng nyo | |
17D2 178A | khmer consonant sign coeng da | |
17D2 178B | khmer consonant sign coeng ttha | |
17D2 178C | khmer consonant sign coeng do | |
17D2 178D | khmer consonant sign coeng ttho | |
17D2 178E | khmer consonant sign coeng na | |
17D2 178F | khmer consonant sign coeng ta | |
17D2 1790 | khmer consonant sign coeng tha | |
17D2 1791 | khmer consonant sign coeng to | |
17D2 1792 | khmer consonant sign coeng tho | |
17D2 1793 | khmer consonant sign coeng no | |
17D2 1794 | khmer consonant sign coeng ba | |
17D2 1795 | khmer consonant sign coeng pha | |
17D2 1796 | khmer consonant sign coeng po | |
17D2 1797 | khmer consonant sign coeng pho | |
17D2 1798 | khmer consonant sign coeng mo | |
17D2 1799 | khmer consonant sign coeng yo | |
17D2 179A | khmer consonant sign coeng ro | |
17D2 179B | khmer consonant sign coeng lo | |
17D2 179C | khmer consonant sign coeng vo | |
17D2 179D | khmer consonant sign coeng sha | |
17D2 179E | khmer consonant sign coeng ssa | |
17D2 179F | khmer consonant sign coeng sa | |
17D2 17A0 | khmer consonant sign coeng ha | |
17D2 17A2 | khmer consonant sign coeng qa | |
17D2 17A7 | khmer vowel sign coeng qu | |
17D2 17AB | khmer vowel sign coeng ry | |
17D2 17AF | khmer vowel sign coeng qe |
The first of these four scripts, Tagalog, is no longer used, although the other three, Hanunóo, Buhid, and Tagbanwa, are living scripts of the Philippines. South Indian scripts of the Pallava dynasty made their way to the Philippines, although the exact route is uncertain. They may have been transported by way of the Kavi scripts of Western Java between the 10th and 14th centuries CE.
There are written accounts of the Tagalog script by Spanish missionaries, and documents in Tagalog dating from the mid-1500s. The first book in this script was printed in Manila in 1593. While the Tagalog script was used to write Tagalog, Bisaya, Ilocano, and other languages, it fell out of normal use by the mid-1700s; modern Tagalog language is now written in the Latin script.
The three living scripts, Hanunóo, Buhid, and Tagbanwa, are related to Tagalog, but may not be directly descended from it. The Hanunóo and the Buhid peoples live in Mindoro, while the Tagbanwa live in Palawan. Hanunóo enjoys the most use; it is widely used to write love poetry, a popular pastime among the Hanunóo. Tagbanwa is less used.
The Philippine scripts share features with the other Brahmi-derived scripts to which they are related.
Consonant Letters. Philippine scripts have consonants containing an inherent -a vowel, which may be modified by the addition of vowel signs or canceled (killed) by the use of a virama-type mark.
Independent Vowel Letters. Philippine scripts have null consonants which are used to write syllables that start with a vowel.
Dependent Vowel Signs. The vowel -i is written with a mark above the associated consonant, and the vowel -u with an identical mark below. The mark is known in Tagalog as kudlit “diacritic,” tuldik “accent,” or tildok “dot,” and ulitan “diacritic” in Tagbanwa. The Philippine scripts employ only the two vowel signs i and u, which are also used to stand for the vowels e and o respectively.
Virama. Though all languages normally written with the Philippine scripts have syllables ending in consonants, not all of the scripts have a mechanism for expressing the canceled -a. As a result, in those orthographies, the final consonants are unexpressed. Francisco Lopez introduced a cross-shaped virama in his 1620 catechism in the Ilocano language, but this innovation did not seem to find favor with native users, who seem to have considered the script adequate without it (they preferred kakapi to kakampi). A similar reform for the Hanunóo script seems to have been better received. The Hanunóo pamudpod was devised by Antoon Postma, who went to the Philippines from the Netherlands in the mid-1950s. In traditional orthography, si apu ba upada is, with the pamudpod, rendered more accurately as si aypud bay upadan; the Hanunóo pronunciation is si aypod bay upadan. The Tagalog virama and Hanunóo pamudpod cancel only the inherent -a. No conjunct consonants are employed in the Philippine scripts.
Directionality. The Philippine scripts are read from left to right in horizontal lines running from top to bottom. They may be written or carved either in that manner, or in vertical lines running from bottom to top, moving from left to right. In the latter case, the letters are written sideways so they may be read horizontally. This method of writing is probably due to the medium and writing implements used. Text is often scratched with a sharp instrument onto beaten strips of bamboo which are held pointing away from the body and worked from the proximal to distal ends, in columns from left to right.
Rendering. In Tagalog and Tagbanwa, the vowel signs simply rest over or under the consonants. In Hanunóo and Buhid, however, special ligatures are often formed as shown in the following tables.
Hanunóo |
Buhid |
Punctuation. Punctuation has been unified for the Philippine scripts. In the Hanunóo block, U+1735 PHILIPPINE SINGLE PUNCTUATION and U+1736 PHILIPPINE DOUBLE PUNCTUATION are encoded. Tagalog makes use only of the latter; Hanunóo, Buhid, and Tagbanwa make use of both of them.
Unicode 3.2 adds 59 new ideographs to the Compatibility Ideographs block. These new compatibility ideographs are found from U+FA30 to U+FA6A. They are included in the Unicode Standard to provide full round-trip compatibility with the ideographic repertoire of JIS X 0213:2000 and should not be used for any other purpose.
Katakana Phonetic Extensions: U+31F0..U+31FF
These extensions to the Katakana syllabary are all “small” variants. They are used in Japan for phonetic transcription of Ainu and other languages.
When Hangul compatibility jamo are transformed with a compatibility normalization form, NFKD or NFKC, the characters are converted to the corresponding conjoining jamo characters. Where the characters are intended to remain in separate syllables after such transformation, they may require separation from adjacent characters. This can be done by inserting any non-Korean character.
For example, the table below illustrates how two Hangul compatibility jamo can be separated in display, even after transforming with NFKD or NFKC.
Original | NFKD | NFKC | Display | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
||||||||||
|
|
|
Like Arabic letters, Mongolian letters have various presentation forms depending on their positions in words. There are additional linguistic constraints that result in variations that must be employed in specific contexts, creating the need for several Mongolian-specific variant selectors, which are encoded at U+180B, U+180C, and U+180D.
The table of standardized variants in the Unicode Character Database found at http://www.unicode.org/Public/3.2-Update/StandardizedVariants-3.2.0.html provides a description of the variant appearances corresponding to the use of appropriate variation selectors with all allowed base Mongolian characters. Only some presentation forms of the base Mongolian characters used with the Mongolian free variation selectors produce variant appearances. These combinations are exhaustively listed and described in the table. All combinations not listed in the table are unspecified and are reserved for future standardization; no conformant process may interpret them as standardized variants.
For more information, see Section 13.7, Variation Selectors, later in this document.
In addition to the symbols in these blocks, mathematical and scientific notation makes frequent use of arrows, punctuation characters, letterlike symbols, geometrical shapes and other miscellaneous and technical symbols. For additional information on all the mathematical operators and other symbols, see Proposed Draft Unicode Technical Report #25, “Unicode Support for Mathematics.”
Other symbols used in mathematical and scientific notation can be found in the Geometric Shapes block. For an extensive discussion of mathematical alphanumeric symbols, see Section 12.2, Letterlike Symbols in The Unicode Standard, Version 3.0. For additional information on all the mathematical operators and other symbols, see Proposed Draft Unicode Technical Report #25, “Unicode Support for Mathematics.”
The Unicode Standard defines a number of additional blocks to supplement the repertoire of mathematical operators and arrows. These additions are intended to extend the Unicode repertoire sufficiently to cover the needs of such applications as MathML, modern mathematical formula editing and presentation software, and symbolic algebra systems.
Standards. MathML, an XML application, is intended to support the full legacy collection of the ISO mathematical entity sets. Accordingly, the repertoire of mathematical symbols for the Unicode Standard has been supplemented by the full list of mathematical entity sets in ISO TR 9573-13, Public entity sets for mathematics and science. Additional repertoire was provided from the amalgamated collection of the STIX Project (Scientific and Technical Information Exchange). That collection includes, but is not limited to, symbols gleaned from mathematical publications by experts of the American Mathematical Society and symbol sets provided by Elsevier Publishing and by the American Physical Society.
Semantics. The same mathematical symbol may have different meanings in different subdisciplines or different contexts. The Unicode Standard only encodes a single character for a single symbolic form. For example, the “+” symbol normally denotes addition in a mathematical context, but might refer to concatenation in a computer science context dealing with strings, or incrementation, or have any number of other functions in given contexts. It is up to the application to distinguish such meanings according to the appropriate context. Where information is available about the usage (or usages) of particular symbols, it has been indicated in the character annotations in Chapter 14, Code Charts in The Unicode Standard, Version 3.0.
This block contains many additional symbols to supplement the collection of mathematical operators.
This block contains symbols used mostly as operators or delimiters in mathematical notation.
Mathematical Brackets. The mathematical white square brackets, angle brackets, and double angle brackets encoded at U+27E6..U+27EB are intended for ordinary mathematical use of these particular bracket types. They are unambiguously narrow, for use in mathematical and scientific notation, and should be distinguished from the corresponding wide forms of white square brackets, angle brackets, and double angle brackets used in CJK typography. (See the CJK Symbols and Punctuation block.) Note especially that the “bra” and “ket” angle brackets, U+2329 LEFT-POINTING ANGLE BRACKET and U+232A RIGHT-POINTING ANGLE BRACKET, are now deprecated for use with mathematics because of their canonical equivalence to CJK angle brackets, which is likely to result in unintended spacing problems if used in mathematical formulae.
This block contains miscellaneous symbols used for mathematical notation, including fences and other delimiters. Some of the symbols in this block may also be used as operators in some contexts.
Wiggly Fence. U+29DB LEFT WIGGLY FENCE has a superficial similarity to U+FE34 PRESENTATION FORM FOR VERTICAL LOW LINE. The latter is a wiggly sidebar character, intended for legacy support as an style of underlining character in a vertical text layout context; it has a compatibility mapping to U+005F LOW LINE. This represents a very different usage from the standard use of fence characters in mathematical notation.
This block contains a small additional set of arrows to supplement the main set in the Arrows block.
Long Arrows. The long arrows encoded in the range U+27F5..U+27FF map to standard SGML entity sets supported by MathML. Long arrows represent distinct semantics from their short counterparts, rather than mere stylistic glyph differences. For example, the shorter forms of arrows are often used in connection with limits, whereas the longer ones are associated with mappings. The use of the long arrows is so common that they were assigned entity names in the ISOAMSA entity set, one of the suite of mathematical symbol entity sets covered by the Unicode Standard.
This block contains a large additional repertoire of arrows to round out the main set in the Arrows block.
Keytop Labels. [to precede “Crops and Quine Corners”] Where possible, keytop labels have been unified with other symbols of like appearance, for example U+21E7 UPWARDS WHITE ARROW to indicate the shift key. While symbols such as U+2318 PLACE OF INTEREST SIGN and U+2388 HELM SYMBOL are generic symbols that have been adapted to use on keytops, other symbols specifically follow ISO/IEC 9995-7.
Angle Brackets. [to follow “Crops and Quine Corners”] U+2329 LEFT-POINTING ANGLE BRACKET and U+232A RIGHT-POINTING ANGLE BRACKET have long been canonically equivalent to the CJK punctuation characters, U+3008 LEFT ANGLE BRACKET and U+3009 RIGHT ANGLE BRACKET, respectively. This canonical equivalence implies that the use of the latter (CJK) code points is preferred, and that U+2329 and U+232A are also “wide” characters. (See Unicode Standard Annex #11, “East Asian Width”, for the definition of the East Asian wide property.) Because of this fact, the use of U+2329 and U+232A is deprecated for mathematics and technical publication, where the wide property of the characters has the potential for interfering with proper formatting of mathematical formulae. Instead, use the angle brackets specifically provided for mathematics: U+27E8 MATHEMATICAL LEFT ANGLE BRACKET and U+27E9 MATHEMATICAL RIGHT ANGLE BRACKET. See Section 12.4, Mathematical Operators earlier in this document.
Symbol Pieces. [to follow “APL Functional Symbols”] The characters in the range U+239B..U+23B3, plus U+23B7, comprise a set of bracket and other symbol fragments for use in mathematical typesetting. These pieces originated in older font standards, but have been used in past mathematical processing as characters in their own right to make up extra-tall glyphs for enclosing multi-line mathematical formulae. Mathematical fences are ordinarily sized to the content that they enclose. However, in creating a large fence, the glyph is not scaled proportionally; in particular the displayed stem weights must remain compatible with the accompanying smaller characters. Thus, simple scaling of font outlines cannot be used to create tall brackets. Instead, a common technique is to build up the symbol from pieces. In particular, the characters U+239B LEFT PARENTHESIS UPPER HOOK through U+23B3 SUMMATION BOTTOM represent a set of glyph pieces for building up large versions of the fences (, ), [, ], {, and }, and of the large operators ∑ and ∫. These brace and operator pieces are compatibility characters. They should not be used in stored mathematical text, but are often used in the data stream created by display and print drivers.
The following table shows which pieces are intended to be used together to create specific symbols.
Use of Symbol Pieces
2-row | 3-row | 5-row | |
Summation | 23B2, 23B3 | ||
Integral | 2320, 2321 | 2320, 23AE, 2321 | 2320, 3×23AE, 2321 |
Left Parenthesis | 239B, 239D | 239B, 239D | 239B, 3×239C, 239D |
Right Parenthesis | 239E, 23A0 | 239E, 239F, 23A0 | 239E, 3×239F, 23A0 |
Left Bracket | 23A1, 23A3 | 23A1, 23A2, 23A4 | 23A1, 3×23A2, 23A3 |
Right Bracket | 23A4, 23A6 | 23A4, 23A5, 23A6 |
23A4, 3×23A5, 23A6 |
Left Brace | 23B0, 23B1 | 23A7, 23A8, 2389 | 23A7, 23AA, 23A8, 23AA, 2389 |
Right Brace | 23B1, 23B0 | 23AB, 23AC, 23AD | 23AB, 23AA, 23AC, 23AA, 23AD |
For example, an instance of U+239B can be positioned relative to instances of U+239C and U+239D to form an extra-tall (three or more line) left parenthesis. The center sections encoded here are meant to be used only with the top and bottom pieces encoded adjacent to them because the segments are usually graphically constructed within the fonts so that they match perfectly when positioned at the same x coordinates.
Vertical Square Brackets. The vertical square brackets, U+23B4 TOP SQUARE BRACKET and U+23B5 BOTTOM SQUARE BRACKET, are compatibility characters for legacy applications emulating certain terminals. They are intended for those terminal applications only, for limited use in vertically-oriented bracketed expressions. U+23B6 BOTTOM SQUARE BRACKET OVER TOP SQUARE BRACKET is used when a single character cell is both the end of one such expression and the start of another. These compatibility characters should not be confused with the general need for rotated glyphs for parentheses, brackets, braces, and quotation marks for vertically rendered CJK text. Such rotations should be handled by fonts and rendering software, rather than by separate encoding of each rotated glyph as a character. See further discussion in Section 6.1, General Punctuation in The Unicode Standard, Version 3.0.
Terminal Graphics Characters. In addition to the box-drawing characters in the Box Drawing block, a small number of additional vertical or horizontal line characters are encoded in the Miscellaneous Technical symbols block to complete the set of compatibility characters needed for applications which need to emulate various old terminals. The horizontal scan line characters, U+23BA HORIZONTAL SCAN LINE-1 through U+23BD HORIZONTAL SCAN LINE-9, in particular, represent characters that were encoded in character ROM for use with 9-line character graphic cells. Horizontal scan line characters are encoded for scan lines 1, 3, 7, and 9. The horizontal scan line character for scan line 5 is unified with U+2500 BOX DRAWINGS LIGHT HORIZONTAL.
Dental Symbols. The set of symbols from U+23BE to U+23CC form a set of symbols from JIS X0213 for use in dental notation.
Standards. This block contains a large number of symbols from ISO/IEC 9995-7:1994, Information technology—Keyboard layouts for text and office systems—Part 7: Symbols used to represent functions.
Plastic Bottle Material Code System. The seven numbered logos encoded from U+2673 to U+2679 are from “The Plastic Bottle Material Code System,” introduced in 1988 by the Society of the Plastics Industry (SPI) (see http://www.socplas.org). This set consistently uses thin, two-dimensional curved arrows suitable for use in plastics molding. In actual use, the symbols often are combined with an abbreviation of the material class below the triangle. Such abbreviations are not universal, therefore they are not present in the representative glyphs in Chapter 14, Code Charts in The Unicode Standard, Version 3.0.
Recycling Symbol for Generic Materials. An unnumbered plastic resin code symbol U+267A RECYCLING SYMBOL FOR GENERIC MATERIALS is not formally part of the SPI system, but is found in many fonts. Occasional use of this symbol as a generic materials code symbol can be found in the field, usually with a text legend below, but sometimes also surrounding (or overlaid by) other text or symbols. Sometimes, the UNIVERSAL RECYCLING SYMBOL is substituted for the generic symbol in this context.
Universal Recycling Symbol. Unicode encodes two common glyph variants of this symbol, U+2672 UNIVERSAL RECYCLING SYMBOL and U+267B BLACK UNIVERSAL RECYCLING SYMBOL. Both are used to indicate that the material is recyclable. The white form is the traditional version of the symbol, but the black form is sometimes substituted, presumably because the thin outlines of the white form do not always reproduce well.
Paper Recycling Symbols. The two paper recycling symbols U+267C RECYCLED PAPER SYMBOL and U+267D PARTIALLY-RECYCLED PAPER SYMBOL can be used to distinguish fully and partially recycled fiber content in paper products or packaging. They are usually accompanied by additional text.
The following text replaces the text on Dingbats on pages 305-306 of The Unicode Standard, Version 3.0:
The Dingbats are derived from a well-established set of glyphs, the ITC Zapf Dingbats series 100, which comprises the industry standard “Zapf Dingbat” font currently available in most laser printers. Other series of dingbat glyphs also exist, but are not encoded in the Unicode Standard because they are not widely implemented in existing hardware and software as character-encoded fonts. The order of the Dingbats block basically follows the PostScript encoding.
Unifications. Where a dingbat from the ITC Zapf Dingbats series 100 could be unified with a generic symbol widely used in other contexts, only the generic symbol was encoded. This accounts for the encoding gaps in the Dingbats block. Examples of such unifications include card suits, BLACK STAR, BLACK TELEPHONE, and BLACK RIGHT-POINTING INDEX (see “Miscellaneous Symbols”); BLACK CIRCLE and BLACK SQUARE (see “Geometric Shapes”); white encircled numbers 1 to 10 (see “Enclosed Alphanumerics”); and several generic arrows (see “Arrows”). Those four entries appear elsewhere in this section.
In other instances, other glyphs from the ITC Zapf Dingbats series 100 glyphs have come to be recognized as having applicability as generic symbols, despite having originally been encoded in the Dingbats block. For example, the series of negative (black) circled numbers 1 to 10 are now treated as generic symbols for this sequence, the continuation of which can be found in “Enclosed Alphanumerics”. Other examples include U+2708 AIRPLANE and U+2709 ENVELOPE, which have definite semantics independent of the specific glyph shape, and which therefore should be considered generic symbols, rather than as symbols representing only the Zapf Dingbat glyph shapes.
For many of the remaining characters in the Dingbat block, their semantic value is primarily their shape; unlike characters that represent letters from a script, there is no well-established range of typeface variations for a dingbat that will retain its identity and therefore its semantics. It would be incorrect to arbitrarily replace U+279D TRIANGLE-HEADED RIGHTWARDS ARROW with any other right arrow dingbat or with any of the generic arrows from the Arrows block (U+2190..U+21FF). But exact shape retention for the glyphs is not always required in order to maintain the relevant distinctions. For example, ornamental characters such as U+2741 EIGHT PETALLED OUTLINE BLACK FLORETTE have been successfully implemented in font faces other than Zapf Dingbats with glyph shapes which are similar, but not identical to the ITC Zapf Dingbats series 100.
The following guidelines are provided for font developers wishing to support this block of characters. Characters showing large sets of contrastive glyph shapes in the Dingbats block, and in particular the various arrow shapes at U+2794..U+27BE, should have glyphs that are closely modeled on the ITC Zapf Dingbats series 100, which are shown as representative glyphs in the code charts. The same applies to the various stars, asterisks, and snowflakes, drop-shadowed squares, checkmarks, and x’s, many of which are ornamental, and have an elaborate name describing their glyph.
Where the above does not apply, or where dingbats have more generic applicability as a symbol, their glyphs do not need not to match the representative glyphs in the code charts in every detail.
Ornamental Brackets. The 14 ornamental brackets encoded at U+2768..U+2775 are a late addition to the set of Zapf Dingbats encoded in the Unicode Standard. Although they have always been included in Zapf Dingbats fonts, they were unencoded in PostScript versions of the fonts on some platforms, and hence were omitted from the original set encoded in Unicode. They have been added for compatibility and consistency in handling of the cmaps for current versions of the fonts.
These mathematical variants are all produced with the addition of U+FE00 VARIATION SELECTOR-1 (VS1) to mathematical operator base characters. Only the valid, recognized combinations are listed in the table of standardized variants. All combinations not listed here are unspecified and are reserved for future standardization; no conformant process may interpret them as standardized variants.
In Unicode 3.2 the representative glyphs for U+2278 NEITHER LESS-THAN NOR GREATER-THAN and U+2279 NEITHER GREATER-THAN NOR LESS-THAN are changed from using a vertical cancellation to using a slanted cancellation. This change was made to match the long standing canonical decompositions for these characters, which use U+0338 COMBINING LONG SOLIDUS OVERLAY. Irrespective of this change to the representative glyphs, the symmetric forms using the vertical stroke are acceptable glyph variants. Using U+2278 or U+2279 with VS1 will request these variants explicitly, as will using U+2276 LESS-THAN OR GREATER-THAN or U+2277 GREATER-THAN OR LESS-THAN with U+20D2 COMBINING LONG VERTICAL LINE OVERLAY. Unless fonts are created with the intention to add support for both forms (via VS1 for the upright forms), there is no need to revise the glyphs in existing fonts; the glyphic range implied by using the base character code alone encompasses both shapes.
For more information, see Section 13.7, Variation Selectors, later in this document.
The combining grapheme joiner is used to indicate that adjacent characters belong to the same grapheme cluster. Grapheme clusters are sequences of one or more encoded characters that correspond to what users think of as characters. They include, but are not limited to, combining character sequences such as (g + °), digraphs such as Slovak “ch”, or sequences with letter modifiers such as kw. Grapheme cluster boundaries are important for collation, regular-expressions, and counting “character” positions within text. The Unicode Standard provides a determination of where the default grapheme boundaries fall in a string of characters. This algorithm can be customized for specific locales.
Note: The rules for default grapheme cluster boundaries, default word boundaries and default sentence boundaries are in the process of being superseded by a new Unicode Technical Report #29, Text Boundaries.
There are circumstances where even the locale-specific determination of grapheme boundaries may need to be further tailored on a local basis. These include:
The character U+034F COMBINING GRAPHEME JOINER has been added to prevent inappropriate grapheme breaks. The properties of this character are specified so as to work well with current software for such processes as grapheme-cluster determination, line-break, and collation. In terms of grapheme determination it functions like the Indic viramas. Thus a sequence functions as a single grapheme.
The grapheme joiner prevents line breaking between adjacent characters; however, where the prevention of line breaking is the only desired effect, the word joiner should be used instead (see Unicode Standard Annex #14, “Line Breaking Properties”). In collation, the grapheme joiner should be ignored unless it specifically occurs within a tailored collation element mapping. Thus it is given a completely ignorable collation element in the default collation table, like NULL (see Unicode Technical Standard #10, “Unicode Collation Algorithm” and also ISO/IEC 14651). However, it can be entered into the tailoring rules for any given language, using the UCA and ISO/IEC 14651 tailoring capabilities.
For rendering, the grapheme joiner is an invisible combining character with canonical class of zero. It can bind adjacent characters into a base for combining marks in circumstances described in “Applications of Combining Marks” in Section 3.9, Special Character Properties (revision) in this document. For any specified repertoire, implementation support for this capability can be provided by means of ligature tables in the font, or by means of special placement rules (see http://partners.adobe.com/asn/developer/opentype/main.html). Some display engines may be able to supply runtime generative support. As with other combining marks, there is considerable latitude for display depending on the environment (such as the choice of font).
The combining grapheme joiner must not be confused with the zero width joiner, or the word joiner, which have very different functions. In particular, inserting a combining grapheme joiner between two characters has no effect on their ligation or cursive joining behavior.
In Unicode 3.1.1 and before, the codepoint U+FEFF serves two very different purposes:
If U+FEFF had only the semantic of a signature codepoint, it could be freely deleted from text without affecting the interpretation of the rest of the text. Carelessly appending files together, for example, can result in a signature codepoint in the middle of text. Unfortunately, U+FEFF also has significance as a character. As a ZWNBSP, it indicates that line breaks are not allowed between the adjoining characters. Thus U+FEFF impacts the interpretation of text, and cannot be freely deleted. The overloading of semantics for this codepoint has caused problems for programs and protocols.
The new character U+2060 WORD JOINER has the same semantics in all cases as U+FEFF, except that it cannot be used as a signature. That is, the function of the character is to indicate that the two adjacent characters should not be broken across lines. See the GL category in Unicode Standard Annex #14, “Line Breaking Properties”. In other contexts the character should be ignored.
Unicode 3.2 implementations should support this new character, but also support the ZWNBSP semantic of U+FEFF.
Note: Implementers are strongly encouraged to use word joiner in those circumstances whenever word joining semantics is intended.
The word joiner must not be confused with the zero width joiner or the combining grapheme joiner, which have very different functions. In particular, inserting a word joiner between two characters has no effect on their ligating or cursive joining behavior.
It is the task of the rendering system to select a ligature (where ligatures are possible) as part of the task of creating the most pleasing line layout. Fonts that provide more ligatures give the rendering system more options.
However, defining the locations where ligatures are possible cannot be done by the rendering system, because there are many languages in which this depends not on simple letter pair context but on the meaning of the word in question.
ZWJ and ZWNJ are to be used for the latter task, marking the non-regular cases where ligatures are required or prohibited. This is different from selecting a degree of ligation for stylistic reasons. Such selection is best done with style markup. See Unicode Technical Report #20, “Unicode in XML and other Markup Languages” for more information.
Unicode characters can be represented by a wide variety of glyphs, as discussed in Chapter 2, General Structure in The Unicode Standard, Version 3.0. Occasionally the need arises in text processing to restrict or change the set of glyphs that are to be used to represent a character. Normally such changes are indicated by choice of font or style in rich-text documents. In special circumstances, such a variation from the normal range of appearance needs to be expressed side-by-side in the same document in plain-text contexts, where it is impossible or inconvenient to exchange formatted text. For example, in languages employing the Mongolian script, sometimes a specific variant range of glyphs is needed for a specific textual purpose for which the range of “generic” glyphs is considered inappropriate. The variation selectors are used when characters have essentially the same semantic.
Variation selectors provide a mechanism for specifying a restriction on the set of glyphs that are used to represent a particular character. They also provide a mechanism for specifying variants, such as for CJK Ideographs and Mongolian, that have essentially the same semantic but have substantially different ranges of glyphs. A variation sequence, which always consists of a base character followed by the variation selector, may be specified as part of the Unicode Standard. That sequence is referred to as a variant of the base character. The variation selector affects only the appearance of the base character,* and only in the variation sequences defined in this Standard. The variation selector is not used as a general code extension mechanism:
Only the variation sequences specifically defined in the Unicode Character Database in the file StandardizedVariants.html are sanctioned for standard use; in all other cases the variation selector cannot change the visual appearance of the preceding base character from what it would have had, in the absence of the variation selector.
The base character in a variation sequence is never a combining character or a decomposable character.* The variation selectors themselves are combining marks of combining class 0, and are default ignorable characters. Thus if the variation sequence is not supported, the variation selector should be invisible and ignored. As with all default ignorable characters, this does not preclude modes or environments where the variation selectors should be given visible appearance. For example, a “Show Hidden” mode could reveal the presence of such characters with specialized glyphs, or particular environment could use or require a visual indication of a base character (such as a wavy underline) to show that it is part of a standardized variation sequence that cannot be supported by the current font.
The standardization or support of a particular variation sequence does not limit the set of glyphs that can be used to represent the base character alone. If a user requires a visual distinction between a character and a particular variant of that character, then fonts must be used to make that distinction. The existence of a variation sequence does not preclude the later encoding of a new character with a distinct semantic and a similar or overlapping range of glyphs.
* Note: Just before publication, an inconsistency was discovered between the above principles and the standardization of the two variant sequences <2278, FE00> and <2279, FE00> because U+2278 and U+2279 are in fact decomposable characters. Those variant sequences denote glyph variants of these mathematical symbols with a vertical line instead of a slanted line as the diacritic to indicate the negation.The sequence <2278, FE00> is canonically equivalent to <2276, 0338, FE00>, and the sequence <2279, FE00> is canonically equivalent to <2277, 0338, FE00>. So that these equivalent sequences are given equivalent rendering treatment, the use of U+FE00 would have to be interpreted—exceptionally—as defining a variant appearance for the entire sequence.
Because a combining vertical line overlay, U+20D2 COMBINING LONG VERTICAL LINE OVERLAY, is also available in the Standard, an alternate way of explicitly indicating these particular variants already exists. That alternative mechanism is a safer and more stable way to indicate the distinction, as the inherent complications in allowing variation selectors to follow combining marks may require future corrective action to remove the exceptional variant sequences <2278, FE00> and <2279, FE00> from the table.
Add the following text to the end of Section 14.1, Character Names List on page 335, The Unicode Standard, Version 3.0:
The character names list contains a number of informative subheads which help divide up the list into smaller sublists of similar characters. For example, in the Miscellaneous Symbols block, U+2600..U+26FF, there are subheads for “Astrological symbols”, “Chess symbols”, and so on. Such subheads are editorial and informative, and should not be taken as providing any definitive, normative status information about characters in the sublists they mark, nor about any constraints on what characters could be encoded in the future at reserved code points within their ranges. The subheads are subject to change.
The following code charts contain the characters added in Unicode 3.2. They are shown together with the characters that were part of Unicode 3.1. New characters are shown on a yellow background in these code charts.
Code Charts Notice:
Annotations for many characters have been added or revised throughout the code charts. These are not mentioned explicitly in the list above. Please see http://www.unicode.org/charts for a list of all code charts.
This article contains errata rolled up since the publication of The Unicode Standard, Version 3.1. These errata are listed by date in the table below. For prior errata from Unicode 3.1, see the errata listed in Unicode Standard Annex #27: Unicode 3.1 (http://www.unicode.org/reports/tr27/#errata).
Date | Summary |
---|---|
2002 February 26 | Corrigendum #3: U+F951 Normalization posted. NOTE: This corrigendum is incorporated in, and superseded by, this document. |
2002 January 18 | In UAX #27: Unicode 3.1, in Article IV, Guidelines under the
subsection Unassigned Code Points, “U+FFFC” should instead read “U+FFFB” in
the following sentence: To allow a greater degree of compatibility across versions of the standard, the ranges of U+2060..U+206F, U+FFF0..U+FFFB, and U+E0000..U+E0FFF are reserved for format and control characters (General Category = Cf). |
2001 September 25 | The character U+0B83 TAMIL SIGN VISARGA is actually a
stand-alone character, not a combining character. This character's General
Category has been changed from “Mc” to “Lo” in accordance with this. The
glyph on the left below shows the character in previous charts; the glyph on
the right shows the character as it should appear (without a dotted circle). See
http://www.unicode.org/charts/PDF/U32-0B80.pdf.
|
2001 April 25 | On p. 500, in the Unicode names list in TUS 3.0, the glyph for U+2032 was omitted. It is shown correctly in the code chart on page 498 or see http://www.unicode.org/charts/PDF/U2000.pdf. |
The main change to the Unicode Character Database for Unicode 3.2 is the extension of the data files to cover the character repertoire addition. This most importantly impacts UnicodeData.txt, LineBreak.txt, and EastAsianWidth.txt, each of which has been extended to cover all the newly encoded characters. Also, an updated informative NamesList.txt file is provided to cover the new repertoire.
Property and Property Value Aliases. The PropertyAliases and PropertyValueAliases files contain contain recommended UCD property identifiers and property value identifiers. These identifers can be used for XML formats of UCD data, for regular-expression property tests, and other programmatic textual descriptions of Unicode data. In comparing identifiers, case differences should not be significant, and the presence or absence of an underbar should be ignored. The identifiers in the PropertyAliases and PropertyValueAliases files are normative in the following sense:
Where the identifiers are used to refer to Unicode properties or property values, they can only be used in accordance with the Unicode Character Database semantics.
This does not prevent implementations from using other identifiers to refer to Unicode property or property values. For example, there is nothing to prevent the use of French translations of the identifiers.
Blocks. The normative blocks defined in Blocks.txt have been adjusted slightly, in accordance with Unicode Technical Committee decisions.
The block property values are listed in the Blocks datafile, and are not repeated in the PropertyValueAliases datafile. (Block property values should be used with caution; for more information see Unicode Technical Report #18, “Unicode Regular Expression Guidelines”, Annex A.)
The notes for SpecialCasing.txt have been updated, and the rules for casing involving dotted letters (i, j) have been reformulated more generically.
An updated Index.txt has been provided, to make it easier to locate the newly added characters, particularly for mathematics.
The following new property files have been added:
Other new properties include:
For more information on these new properties, see the relevant documentation in the Unicode Character Database.
Note: For consistency with the property naming conventions, the property BidiMirrored has been renamed to Bidi_Mirrored (see DerivedBinaryProperties.txt). Also the property Comp_Ex has been renamed to Full_Composition_Exclusion (see DerivedNormalizationProps.txt).
For cross-platform interoperability, the file names will be restricted to no more than 31 characters in length. Due to this change in policy, DerivedNormalizationProps.txt is the new file name for the file formerly known as DerivedNormalizationProperties.txt.
The documentation files for the Unicode Character Database have been updated to reflect the additions of new property files and new character properties to existing files, and the new file name length restriction.
ISO/IEC 10646 is a multi-part standard. Part 1, published as ISO/IEC 10646-1:2000(E), covers the Architecture and Basic Multilingual Plane. Part 2, published as ISO/IEC 10646-2:2001(E), covers the supplementary planes. Amendment 1 to Part 1 makes a few modifications to the architecture of 10646 and adds about a thousand characters to the BMP.
Unicode 3.2 contains all of the characters of Amendment 1, including the two characters of Amendment 1 that had already been added to Unicode 3.1. With the publication of Amendment 1 to ISO/IEC 10646-1:2000 and the Unicode Standard, Version 3.2, the two standards are fully synchronized.
The Unicode Consortium and ISO/IEC JTC1/SC2/WG2 are committed to maintaining the synchronization between the two standards.
Notable among the architectural changes to ISO/IEC 10646 approved in Amendment 1 are:
ISO/IEC 9573-13: International Organization for Standardization. Information technology—SGML support facilities—Techniques for using SGML—Part 13: Public entity sets for mathematics and science. [Geneva], 1991. (ISO/IEC TR 9573-13:1991).
ISO/IEC 9995-7: Information technology—Keyboard layouts for text and office systems—Part 7: Symbols used to represent functions. [Geneva], 1994. (ISO/IEC 9995-7:1994).
ISO/IEC 14651: International Organization for Standardization. Information technology—International string ordering and comparison—Method for comparing character strings and description of the common template tailorable ordering. [Geneva], 2001. (ISO/IEC 14651:2001).
JIS X 0213: Japanese Industrial Standards Committee. 7 bitto oyobi 8 bitto no 2 baito jouhou koukan you fugouka kakuchou kanji shuugou (7-bit and 8-bit double byte coded extended KANJI sets for information interchange). Tokyo, 2000. (JIS X 0213:2000).
Doctrina christiana: the first book printed in the Philippines, Manila 1593. A facsimile of the copy in the Lessing J. Rosenwald Collection...with an introductory essay by Edwin Wolf II. Washington, DC, Library of Congress, 1947.
Kuipers, Joel C., and Ray McDermott. “Insular Southeast Asian Scripts.” In The World’s Writing Systems. Edited by Peter T. Daniels and William Bright. New York, Oxford University Press, 1996. ISBN 0-19-507993-0.
Santos, Hector. The Living Scripts. Los Angeles: Sushi Dog Graphics,
1995. (Ancient Philippine Scripts Series; 2).
User’s guide accompanying Computer Fonts, Living Scripts software.
Santos, Hector. Our Living Scripts. January 31, 1997.
http://www.bibingka.com/dahon/living/living.htm
Part of his A Philippine Leaf.
Santos, Hector. The Tagalog Script. Los Angeles: Sushi Dog Graphics,
1994. (Ancient Philippine Scripts Series; 1).
User’s guide accompanying Tagalog Script Fonts software.
Santos, Hector. The Tagalog Script. October 26, 1996.
http://www.bibingka.com/dahon/tagalog/tagalog.htm
Part of his A Philippine Leaf.
STIPUB Consortium. STIX (Scientific and Technical Information Exchange)
Project.
http://www.ams.org/STIX/
The following summarizes modifications from the previous version of this document. Modifications to this document will be limited to repairing straightforward typographical and production errors. Updates in content will be carried out via a future version of the Unicode Standard, published in a separate document.
3 |
|
Copyright © 2001-2002 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.