Revision | 3.0 |
Authors | Lisa Moore (lisam@us.ibm.com) |
Date | 1999-11-21 |
This Version | http://www.unicode.org/unicode/reports/tr8/tr8-3 |
Previous Version | http://www.unicode.org/unicode/reports/tr8/tr8-2 |
Latest Version | http://www.unicode.org/unicode/reports/tr8 |
This report documents the Unicode Standard, Version 2.1.
This document contains informative material and normative specifications which have been considered and approved by the Unicode Technical Committee for publication as a Technical Report and as part of the Unicode Standard, Version 2.1. Any reference to version 2.1 of the Unicode Standard automatically includes this technical report. Please mail corrigenda and other comments to the author.
The content of all technical reports must be understood in the context of the appropriate version of the Unicode Standard. References in this technical report to sections of the Unicode Standard refer to the Unicode Standard, Version 2.0. See http://www.unicode.org/unicode/standard/versions for more information.
Version 2.1 of the Unicode Standard brings together two additions to the repertoire which are expected to be in wide use in a number of implementations, errata collected since the publication of Version 2.0, and a number of updates to the character properties database. The two newly added characters are the U+FFFC OBJECT REPLACEMENT CHARACTER and the U+20AC EURO SIGN. The object replacement character is already employed in multiple implementations, and the euro sign is expected to be widely used very soon as the European Monetary Union (EMU) proceeds to phase in its use as the EMU unit of currency. This modification of the Unicode Standard is made available so that implementers can proceed with their support plans knowing that their implementation of Unicode is a well-defined, conforming version. With the additions of Version 2.1, the Unicode Standard contains 38, 887 characters from the worlds scripts.
Additional characters and scripts have been accepted into the Unicode Standard since the publication of The Unicode Standard, Version 2.0. These are not included in Version 2.1 but are documented on the Unicode Web site at: http://www.unicode.org/unicode/alloc/Pipeline.html
Overall Unicode conformance criteria as described in Chapter 3 of Version 2.0 are unchanged. Specific aspects of the bidirectional algorithm have been modified in Version 2.1, Hangul syllable decompositions have been clarified, and certain normative character property values have been changed.
The U+FFFC OBJECT REPLACEMENT CHARACTER is used as an insertion point for objects located within a stream of text. All other information about the object is kept outside the character data stream. Internally it is a dummy character which acts as an anchor point for the objects formatting information. In addition to assuring correct placement of an object in a data stream, the object replacement character also allows the use of general stream-based algorithms for any textual aspects of embedded objects
The object replacement character is classified as a Symbol, Other (So) and has a bidirectional category of Other Neutrals (ON).
Addition
p 7-523. Add to the standard the following character:
|
The new single currency for member countries of the European Monetary Union (EMU) is the euro. The euro character is encoded in the Unicode Standard as U+20AC EURO SIGN.
To avoid confusion, the historical character U+20A0 EURO-CURRENCY SIGN has been updated with an informative note and a cross reference to U+20AC EURO SIGN.
The euro character is classified as Symbol, Currency (Sc) and has a bidirectional category of European Number Terminator (ET).
Corrigendum
p 7-161. Currency symbols character names list Add the following informative note for character 20A0: "Historical character derived from Xerox Character Code Standard" Add the following cross reference for character 20A0: "20AC euro sign" |
Addition
p 7-161. Add to the standard the following character:
Add the following informative note for 20AC: "Currency sign for the European Monetary Union" Add the following cross reference for 20AC: "20A0 euro-currency sign" |
Additional Unicode characters have been designated as having the mathematical property. Typos in the Version 2.0 list of characters with the mathematical property have also been corrected.
Corrigenda
p 4-25. In the list following section 4.9 Change 20A6 to 2016. Change "20D2..20E1" to "20D0..20DC, 20E1". Add the following characters to the list:
|
Two characters have been removed from the alphabetics listing, U+02BC MODIFIER LETTER APOSTROPHE and U+055A ARMENIAN APOSTROPHE.
Corrigendum
p 4-14. Section 4.5 Letters Remove 02BC and 055A from the table of alphabetics. |
The status of Hangul Syllable decompositions have been clarified.
Corrigenda
p 3-7. D23 Change the first sentence to read: "canonical decomposition: the decomposition of a character which results from recursively applying the canonical mappings found in the names list of Section 7.1, Character Names List Entries and those described in Section 3.10 Combining Jamo Behavior until no characters can be further decomposed, and then reordering non-spacing marks according to Section 3.9, Canonical Ordering Behavior." p 3-11. Section 3.10 Combining Jamo Behavior Change the third bullet to: "determine the canonical decomposition of Hangul syllables" p 3-13. Item 1 Change the first sentence to: "Process C by composing the conjoining jamo wherever possible, according to the compatibility decomposition rules in Chapter 7, Code Charts." Change the fourth sentence to: "Raw keyboard data, on the other hand, may be in the form of a compatibility decomposition." p 3-13. Hangul Syllable Decomposition Change the first sentence to: "The following describes the reverse mapping - how to take Hangul syllable S and derive the canonical decomposition C." |
New distinctions have been made in the Unicode Character Database for use in identifiers. In addition changes have been made to the text of the standard.
Corrigenda
p. 5-26, 27. Section 5.14 Identifiers Add 06DD and 06DE to <enclosing_char>. Add compatibility low lines FE33, FE34, FE4D..FE4F to <underscore>. Remove 0387 from <extender>. Remove <identifier_part> and its definition. Change the <identifier> syntactic rule to:
Add the following syntactic rules at the end of the list:
Following the syntactic rules add the following: "Identifiers are ultimately defined by a set of character categories from the Unicode Character Database. (The individual Terminal Classes described in the text do not have a one-to-one relationship with the character categories, but the resulting definitions of identifiers are intended to be the same.
For an explicit list of the current coverage of each of these syntactic classes, see <identifier_start>, <identifier_extend>, and <ident_ignorable_char>." |
Since the Unicode Standard Version 2, many aspects of the bidirectional behavior algorithm have been clarified or modified, including the basic display algorithm, bidirectional character types, base levels, resolving weak and neutral types, and resolving implicit levels. These changes affect pages 3-14 through 3-23 of the standard. Additionally, a few characters have been assigned new bidirectional type properties.
The description of the scope of the algorithm within a block has been clarified, and a pointer to further information on the handling of CR and LF has been added.
Corrigendum
p 3-16. At the end of the paragraph before the first bullet, add: "The algorithm only reorders text within a block; characters on one side of a block separator have no effect on characters on the other side. (Also, see Section 4.3, Directionality on the handling of CR, LF, and CRLF)" |
The following (together with a change to Reordering Resolved Levels) clarifies how to implement the last paragraph of page 3-16.
Corrigenda
p 3-17. Before Table 3-5, add: "Combining marks are given the type of the preceding letter." p 4-11. After "where there are gaps.", add: "Combining marks are given the type of the preceding letter, and are not called out in this table either." |
Several of the rules were corrected to say embedding direction rather than global direction. The first term is more explicitly defined.
Corrigendum
p 3-18. Before "Explicit Levels and Directions", insert: "The direction of the current embedding level (for a character in question) is called the embedding direction. It is L if the embedding level is even, and R if the embedding level is odd." |
T6 incorrectly removed implicit and explicit directional formatting codes. The original purpose of T6 was to allow the use of styles or style sheets instead of embedding or override codes (see p. 3-22). T6 has been eliminated, and N4 has been changed instead (see below).
Corrigendum
p 3-19. T6 Delete T6. |
P1 has been clarified to state that it applies to single characters, and P2 more explicitly shows how to resolve a sequence of European terminators.
Corrigendum
p 3-19. P1 Change to "P1. A single European separator between two European numbers changes to an European number. A single common separator between two numbers of the same type changes to that type." p 3-19. P2 Change to "P2. A sequence of European terminators adjacent to European numbers changes to all European numbers. ET, ET, EN EN, EN, EN EN, ET, ET EN, EN, EN AN, ET, EN AN, EN,EN" p 3-19. P3 Add example at end. "ET, AN N, AN" |
The wording in N2 has been modified to use the embedding direction instead of the global direction, and the confusing term "letter" has been changed to "character" which makes it clear that strong R punctuation should be included.
Corrigenda
p 3-19. N2 Replace "global" by "embedding". p 3-20. N3 Change "letter" to "character" everywhere. |
Since N4 describes the behavior of embedding codes, it has been moved to a more appropriate place in the algorithm. It replaces T6 and now describes the behavior of override codes as well.
Corrigenda
p 3-19, 20. Move N4 to where T6 was. Change the number to T6, and change the wording and examples to: "T6. In the following rules, an embedding or override code and its matching PDF act as if they were strong characters of the appropriate type. All unmatched PDFs are ignored. If two embeddings with the same level are adjacent, then the PDF terminating the first embedding and the code initiating the next embedding are ignored. LRO ... PDF L ... L LRE ... PDF L ... L RLO ... PDF R ... R RLE ... PDF R ... R RLE ... PDF, RLO ... PDF RLE ..., ... PDF" |
I1 and I2 have been modified to ensure that implementers will use the embedding direction instead of the base direction. Also, although Table 3-7 refers to Sequence Type, the wording was not clear that the rules applied to sequences. This is important in the case of EN.
Corrigenda
p 3-20, 21. I1 Replace "global" by "embedding". Replace "Numeric text (EN) goes up two levels unless preceded by left-to-right
text." by: "A sequence of one or more numeric types (EN) goes up two levels unless immediately preceded by left-to-right text." Change the example from "(L) EN" to "(L) EN...EN" |
L1 incorrectly implied that there could be more than one block separator. This has been corrected and more explanation is provided.
Corrigenda
p 3-20. L1 Add to the end of the paragraph before L1: "The process of breaking a paragraph into one or more lines that fit within particular bounds is outside the scope of the bidirectional algorithm. Where character shaping is involved, it can be somewhat more complicated (see pages 6-22 through 6-32). Logically there are the following steps:
Change in L1, "trailing white space (including block separators)" to "any trailing white space characters (including those of type B, S, and WS)". Add after L1, "(Note: since a Block separator breaks lines, there will be at most one per line.)" Before "Bidirectional Conformance", add: "Combining marks applied to a right-to-left base character will at this point precede their base character. See Section 5.12 Rendering Non-Spacing Marks for an illustration of this. If the rendering engine expects them to follow the base characters in the final display process, then the ordering of the marks and the base character will need to be reversed." |
Certain characters have new bidirectional property definitions. To improve the display of e-mail addresses and URLs, the directional types of U+0026 AMPERSAND and U+0040 COMMERCIAL AT have been changed from left-to-right to other neutral. The directional type of U+002E FULL STOP has been changed from EUROPEAN NUMBER SEPARATOR to COMMON NUMBER SEPARATOR to improve the display of decimal numbers; U+2007 FIGURE SPACE has also been changed from EUROPEAN NUMBER SEPARATOR to COMMON NUMBER SEPARATOR for consistency.
Corrigenda
p 4-11. Table 4.4 Bidirectional Character Types Remove the table entry "Miscellaneous U+0026, U+0040" from the strong left-to-right category. Remove the table entries "Full Stop (Period) U+002E" and "Figure Space U+2007" from the European Number Separator category. p 4-12. Table 4.4 Bidirectional Character Types Add the table entries "Full Stop (Period) U+002E" and "Figure Space U+2007" to the Common Number Separator category. |
The following corrigenda clarify the semantics of different apostrophes, and correct problems in the mapping tables from Windows and Macintosh code pages.
Corrigendum
p 6-3. Add at the end of Loose versus Precise Semantics: "For historical reasons, U+0027 is a particularly overloaded character. In ASCII it is used to represent a punctuation mark (such as right single quotation mark, left single quotation mark, apostrophe punctuation, vertical line, or prime) or a modifier letter (such as apostrophe modifier or acute accent.) (Punctuation marks generally break words; modifier letters generally are considered part of a word.) In many systems it is always represented as a straight vertical line and can never represent a curly apostrophe or right quotation mark. In the case of an apostrophe,
In implementation, however, you cannot assume that users text always adheres to the distinction between these characters. The text may come from different sources, including mapping from other character sets that do not have this distinction between letter apostrophe and punctuation apostrophe/right single quotation mark. In that case, all of them will generally be represented by U+2019. Where you are parsing text where such distinctions are important, you will still need to look at the context around the characters to help disambiguate the relevant semantics." |
Corrigendum
p 7-7. Change character 0027 informative notes, second bullet to: "preferred character for apostrophe is either 02BC MODIFIER LETTER APOSTROPHE or 2019 RIGHT SINGLE QUOTATION MARK (which also represents a punctuation apostrophe)." |
Corrigendum
p 7-37. Change character 02BC informative notes, third bullet to: "this is the preferred character for letter apostrophe." |
Corrigendum
p 7-155. Change character 2019 informative notes, first bullet to: "this is the preferred character for quotation mark and punctuation apostrophe." |
The following are typographic errors in the text of the standard.
Corrigenda
pp 7-50..7-55. Change the page header to "0400...Cyrillic...04FF". pp 7-66..7-70. Change the page header to "0600...Arabic...06FF". |
A number of glyphs have been corrected. The corrections are given here and can be found on the Unicode Web site at:
http://www.unicode.org/unicode/uni2errata/UnicodeTypos.html
Additional glyph corrections will be posted to this site as available.
Corrigenda | ||
05F1 | HEBREW LIGATURE YIDDISH VAV YOD | |
2603 | SNOWMAN | |
3085 | HIRAGANA LETTER SMALL YU | |
FA0E | CJK Compatibility Ideograph | |
FA0F | CJK Compatibility Ideograph | |
FA10 | CJK Compatibility Ideograph | |
FA11 | CJK Compatibility Ideograph | |
FA12 | CJK Compatibility Ideograph | |
FA13 | CJK Compatibility Ideograph | |
FA14 | CJK Compatibility Ideograph | |
FA15 | CJK Compatibility Ideograph | |
FA16 | CJK Compatibility Ideograph | |
FA17 | CJK Compatibility Ideograph | |
FA18 | CJK Compatibility Ideograph | |
FA19 | CJK Compatibility Ideograph | |
FA1A | CJK Compatibility Ideograph | |
FA1B | CJK Compatibility Ideograph | |
FA1C | CJK Compatibility Ideograph | |
FA1D | CJK Compatibility Ideograph | |
FA1E | CJK Compatibility Ideograph | |
FA1F | CJK Compatibility Ideograph | |
FA20 | CJK Compatibility Ideograph | |
FA21 | CJK Compatibility Ideograph | |
FA22 | CJK Compatibility Ideograph | |
FA23 | CJK Compatibility Ideograph | |
FA24 | CJK Compatibility Ideograph | |
FA25 | CJK Compatibility Ideograph | |
FA26 | CJK Compatibility Ideograph | |
FA27 | CJK Compatibility Ideograph | |
FA28 | CJK Compatibility Ideograph | |
FA29 | CJK Compatibility Ideograph | |
FA2A | CJK Compatibility Ideograph | |
FA2B | CJK Compatibility Ideograph | |
FA2C | CJK Compatibility Ideograph | |
FA2D | CJK Compatibility Ideograph |
The UTF-7 specification was unclear on one point, which led to an error in the sample code for converting from UCS-2 to UTF-7. The problem occurs when U+002D HYPHEN-MINUS follows a character that must be encoded. Because ASCII 0x2D is the terminating character for an encoded sequence, two 0x2D characters must be output in order to preserve the U+002D when converting back to Unicode.
RFC 2152 has been published with a revised version of the UTF-7 specifications. The file included with the CD-ROM file has been updated with this fix.
p A-5. The correction is in the code near the bottom of the page. The new text is highlighted.
if (!needshift) { /* Write the explicit shift out character if 1) The caller has requested that we always do it, or 2) The directly encoded character is in the base64 set, or 3) The directly encoded character is SHIFT_OUT. */ if (verbose || ((!done) && (invbase64[r] >=0 || r == SHIFT_OUT))) { TARGETCHECK; *target++ = SHIFT_OUT } shifted = 0; }
In addition to including the properties for the object replacement character and the euro sign, the Unicode Technical Committee has approved changes to the Unicode Character Database to reconcile problems found in an analysis of the character categories, and to make new distinctions in the database for use in identifiers. The property changes reflect the following:
Encoding of U+20AC EURO SIGN and U+FFFC OBJECT REPLACEMENT CHARACTER
Removing space, white space and delimitation as characteristics of U+FEFF
Narrowing the concept of white space to avoid miscellaneous ignorable Unicode controls and the Unicode NULL.
Mandated changes in directional properties, expanded to compatibility forms for consistency
The details are given in the following table:
Space | Remove FEFF |
White space | Remove 0000, 200C..200F,202A..202E, 206A..206F, FEFF |
Punctuation | Add 00B7 |
Delimiter | Remove FEFF |
Currency Symbol | Add 20AC |
Bidi: Left-to-Right | Remove 0026, 0040, FE60, FE6B, FF06, FF20 |
Bidi: Eur Num Term | Add 20AC |
Bidi: Eur Num Sep | Remove 002E, 2007, FE52, FF0E |
Bidi: Common Sep | Add 002E, 2007, FE52, FF0E |
Bidi: Other Neutrals | Add 0026, 0040, FE60, FE6B, FF06, FF20 |
Unassigned Code Value | Remove 20AC, FFFC |
This new information is reflected in the newest version of the Unicode Character Database and the additional properties files in the Unicode 2.1 Update directory on the unicode.org ftp site:
ftp://ftp.unicode.org/Public/2.1-Update/
The 2.1 files in the update directory supersede the three 2.0 files on the CD-ROM, which is distributed with The Unicode Standard, Version 2.0, and which are also available at:
ftp://ftp.unicode.org/Public/UNIDATA
PROPS2.TXT (superseded by PropList-2.1.1.txt)
UNIDAT2.TXT (superseded by UnicodeData-2.1.1.txt)
README2.TXT (superseded by ReadMe-2.1.1.txt)
A diff file cataloging the changes in the Unicode Character Database file is also available:
ftp://ftp.unicode.org/Public/2.1-Update/diff2014v211.txt
Formatting corrections were made.
Correction of typographical and glyph errors as follows:
1. Typo in section 3.4 Identifier Errata, third line describing compatability low lines corrected to read FE33, not FF33.
2. Glyph for U+FA0E in section 3.8 corrected.
3. Under 3.4 Identifier Errata, in the small unlined table towards the bottom, under "Coverage," second entry, changed "Enclosing mark" to "Spacing combining mark."
Internal hyperlinks added at beginning of document.
Correction of two typographical errors as follows:
1. In the section 3.9 "UTF-7 Sample Code Correction", in the sentence, " The problem occurs when U+200D HYPHEN-MINUS follows a character that must be encoded." "U+200D" corrected to read "U+002D".
2. In section 3.6 the third corrigendum, "p 7-37. Change character 02BC informative notes, first bullet to:" "first" corrected to read "third."
Copyright © 1998-1999 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.