L2/04-405
Re | TR29 Corrections |
From: | Mark Davis |
Date: | 2004-11-11 |
Action 99-A57 was erroneously marked as done, and thus the changes that it encompassed did not make it into the posted proposed update of UAX 29. The following are the extracted portions of the UAX that are need to be changed so as to make the changes in 99-A57. In addition, the generation of the property files as per the UTC decision revealed cases where the properties were not orthogonal as defined, so their definitions needed to be adjusted.
Note that 99-A57 was created before the Katakana_or_hiragana script value was withdrawn, so the action had to be reinterpreted in that light.
This needs to be incorporated into a new public review of the UAX for Unicode 4.1.
Table 2. Default Word Boundaries
Format | General_Category = Format (Cf) and not U+200C ZERO WIDTH NON-JOINER (ZWNJ) and not U+200D ZERO WIDTH JOINER (ZWJ) |
Katakana | Script = KATAKANA, or Any of the following: U+3031 (〱) VERTICAL KANA REPEAT MARK U+3032 (〲) VERTICAL KANA REPEAT WITH VOICED SOUND MARK U+3033 (〳) VERTICAL KANA REPEAT MARK UPPER HALF U+3034 (〴) VERTICAL KANA REPEAT WITH VOICED SOUND MARK UPPER HALF U+3035 (〵) VERTICAL KANA REPEAT MARK LOWER HALF U+309B (゛) KATAKANA-HIRAGANA VOICED SOUND MARK U+309C (゜) KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK U+30A0 (゠) KATAKANA-HIRAGANA DOUBLE HYPHEN U+30FC (ー) KATAKANA-HIRAGANA PROLONGED SOUND MARK U+FF70 (ー) HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK U+FF9E (゙) HALFWIDTH KATAKANA VOICED SOUND MARK U+FF9F (゚) HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK |
ALetter | Alphabetic = true, or U+00A0 ( ) NO-BREAK SPACE (NBSP) or U+05F3 (׳) HEBREW PUNCTUATION GERESH and not Ideographic = true and not Katakana = true and not Script = Thai and not Script = Lao and not Script = Hiragana and not GRAPHEME EXTEND = true |
MidLetter | Any of the following: U+0027 (') APOSTROPHE U+00B7 (·) MIDDLE DOT U+05F4 (״) HEBREW PUNCTUATION GERSHAYIM U+2019 (’) RIGHT SINGLE QUOTATION MARK (curly apostrophe) U+2027 (‧) HYPHENATION POINT U+003A (:) COLON (used in Swedish) |
MidNumLet | Any of the following: U+002E (.) FULL STOP (period) U+003A (:) COLON (used in Swedish) |
MidNum | Line_Break = Infix_Numeric and not MidNumLet = true and not U+003A (:) COLON |
Numeric | Line_Break = Numeric |
ExtendNumLet | General_Category=Connector_Punctuation and not U+30FB KATAKANA MIDDLE DOT and not U+FF65 HALFWIDTH KATAKANA MIDDLE DOT |
Any | Any character (includes all of the above) |
Assign each code point with line break property values of CB, SA, SG, and XX to one of the above boundary property values depending on criteria outside the scope of this algorithm. Characters with other line break properties are assigned values directly according to the above table. |
(0) | ||
Break at the start and end of text. |
|||
sot | ÷ | (1) | |
÷ | eot | (2) | |
Treat a grapheme cluster as if it were a single character: the first character of the cluster. |
|||
GC |
→ |
FC | (3) |
Ignore trailing Format characters. That is, ignore Format characters in all subsequent rules (except the last rule). |
|||
X Format* | → | X | (4) |
Do not break between most letters. |
|||
ALetter | × | ALetter | (5) |
Do not break letters across certain punctuation. |
|||
ALetter | × | (MidLetter | MidNumLet) ALetter | (6) |
ALetter (MidLetter | MidNumLet) | × | ALetter | (7) |
Do not break within sequences of digits, or digits adjacent to letters ('3a', or 'A3'). |
|||
Numeric | × | Numeric | (8) |
ALetter | × | Numeric | (9) |
Numeric | × | ALetter | (10) |
Do not break within sequences like: ‘3.2’ or '3,456.789'. |
|||
Numeric (MidNum | MidNumLet) | × | Numeric | (11) |
Numeric | × | (MidNum | MidNumLet) Numeric | (12) |
Do not break between Katakana. |
|||
Katakana | × | Katakana | (13) |
Do not break from extenders | |||
(ALetter | Numeric | Katakana | ExtendNumLet) | × | ExtendNumLet | (13a) |
ExtendNumLet | × | (ALetter | Numeric | Katakana) | (13b) |
Otherwise, break everywhere (including around ideographs). |
|||
Any | ÷ | Any | (14) |
Table 3. Default Sentence Boundaries
Sep | Any of the following characters: U+000A LINE FEED (LF) U+000D CARRIAGE RETURN (CR) U+0085 NEXT LINE (NEL) U+2028 LINE SEPARATOR (LS) U+2029 PARAGRAPH SEPARATOR (PS) |
Format | General_Category = Format (Cf) and not U+200C ZERO WIDTH NON-JOINER (ZWNJ) and not U+200D ZERO WIDTH JOINER (ZWJ) |
Sp | Whitespace = true and not Sep = true and not U+00A0 ( ) NO-BREAK SPACE (NBSP) |
Lower | Lowercase = true and not GRAPHEME EXTEND = true |
Upper | General_Category = Titlecase_Letter (Lt), or Uppercase = true |
OLetter | Alphabetic = true, or U+00A0 ( ) NO-BREAK SPACE (NBSP), or U+05F3 (׳) HEBREW PUNCTUATION GERESH and not Lower = true and not Upper = true and not GRAPHEME EXTEND = true |
Numeric | Linebreak = Numeric (NU) |
ATerm | Any of the following characters: U+002E (.) FULL STOP |
STerm | STerm = true and not ATerm = true |
Close | General_Category = Open_Punctuation (Po), or General_Category = Close_Punctuation (Pe), or Linebreak = Quotation (QU) and not U+05F3 (׳) HEBREW PUNCTUATION GERESH and not ATerm=true and not STerm = true |
Any | Any character (includes all of the above) |
Note: the extra condition in STerm should really be repaired by removing ATerm from the definition of STerm in the Proplist file.