UAX #29: Text Boundaries

L2/04-405

Re	TR29 Corrections
From:	Mark Davis
Date:	2004-11-11

Action 99-A57 was erroneously marked as done, and thus the changes that it encompassed did not make it into the posted proposed update of UAX 29. The following are the extracted portions of the UAX that are need to be changed so as to make the changes in 99-A57. In addition, the generation of the property files as per the UTC decision revealed cases where the properties were not orthogonal as defined, so their definitions needed to be adjusted.

Note that 99-A57 was created before the Katakana_or_hiragana script value was withdrawn, so the action had to be reinterpreted in that light.

This needs to be incorporated into a new public review of the UAX for Unicode 4.1.

Table 2. Default Word Boundaries

Boundary Property Values
Format	General_Category = Format (Cf) and not U+200C ZERO WIDTH NON-JOINER (ZWNJ) and not U+200D ZERO WIDTH JOINER (ZWJ)
Katakana	Script = KATAKANA, or Any of the following: U+3031 (〱) VERTICAL KANA REPEAT MARK U+3032 (〲) VERTICAL KANA REPEAT WITH VOICED SOUND MARK U+3033 (〳) VERTICAL KANA REPEAT MARK UPPER HALF U+3034 (〴) VERTICAL KANA REPEAT WITH VOICED SOUND MARK UPPER HALF U+3035 (〵) VERTICAL KANA REPEAT MARK LOWER HALF U+309B (゛) KATAKANA-HIRAGANA VOICED SOUND MARK U+309C (゜) KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK U+30A0 (゠) KATAKANA-HIRAGANA DOUBLE HYPHEN U+30FC (ー) KATAKANA-HIRAGANA PROLONGED SOUND MARK U+FF70 (ｰ) HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK U+FF9E (ﾞ) HALFWIDTH KATAKANA VOICED SOUND MARK U+FF9F (ﾟ) HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK
ALetter	Alphabetic = true, or U+00A0 ( ) NO-BREAK SPACE (NBSP) or U+05F3 (׳) HEBREW PUNCTUATION GERESH and not Ideographic = true and not Katakana = true and not Script = Thai and not Script = Lao and not Script = Hiragana and not GRAPHEME EXTEND = true
MidLetter	Any of the following: U+0027 (') APOSTROPHE U+00B7 (·) MIDDLE DOT U+05F4 (״) HEBREW PUNCTUATION GERSHAYIM U+2019 (’) RIGHT SINGLE QUOTATION MARK (curly apostrophe) U+2027 (‧) HYPHENATION POINT U+003A (:) COLON (used in Swedish)
MidNumLet	Any of the following: U+002E (.) FULL STOP (period) U+003A (:) COLON (used in Swedish)
MidNum	Line_Break = Infix_Numeric and not MidNumLet = true and not U+003A (:) COLON
Numeric	Line_Break = Numeric
ExtendNumLet	General_Category=Connector_Punctuation and not U+30FB KATAKANA MIDDLE DOT and not U+FF65 HALFWIDTH KATAKANA MIDDLE DOT
Any	Any character (includes all of the above)

Boundary Rules
Assign each code point with line break property values of CB, SA, SG, and XX to one of the above boundary property values depending on criteria outside the scope of this algorithm. Characters with other line break properties are assigned values directly according to the above table.			(0)
Break at the start and end of text.
sot	÷		(1)
	÷	eot	(2)
Treat a grapheme cluster as if it were a single character: the first character of the cluster.
GC	→	FC	(3)
Ignore trailing Format characters. That is, ignore Format characters in all subsequent rules (except the last rule).
X Format*	→	X	(4)
Do not break between most letters.
ALetter	×	ALetter	(5)
Do not break letters across certain punctuation.
ALetter	×	(MidLetter \| MidNumLet) ALetter	(6)
ALetter (MidLetter \| MidNumLet)	×	ALetter	(7)
Do not break within sequences of digits, or digits adjacent to letters ('3a', or 'A3').
Numeric	×	Numeric	(8)
ALetter	×	Numeric	(9)
Numeric	×	ALetter	(10)
Do not break within sequences like: ‘3.2’ or '3,456.789'.
Numeric (MidNum \| MidNumLet)	×	Numeric	(11)
Numeric	×	(MidNum \| MidNumLet) Numeric	(12)
Do not break between Katakana.
Katakana	×	Katakana	(13)
Do not break from extenders
(ALetter \| Numeric \| Katakana \| ExtendNumLet)	×	ExtendNumLet	(13a)
ExtendNumLet	×	(ALetter \| Numeric \| Katakana)	(13b)
Otherwise, break everywhere (including around ideographs).
Any	÷	Any	(14)

Table 3. Default Sentence Boundaries

Boundary Property Values
Sep	Any of the following characters: U+000A LINE FEED (LF) U+000D CARRIAGE RETURN (CR) U+0085 NEXT LINE (NEL) U+2028 LINE SEPARATOR (LS) U+2029 PARAGRAPH SEPARATOR (PS)
Format	General_Category = Format (Cf) and not U+200C ZERO WIDTH NON-JOINER (ZWNJ) and not U+200D ZERO WIDTH JOINER (ZWJ)
Sp	Whitespace = true and not Sep = true and not U+00A0 ( ) NO-BREAK SPACE (NBSP)
Lower	Lowercase = true and not GRAPHEME EXTEND = true
Upper	General_Category = Titlecase_Letter (Lt), or Uppercase = true
OLetter	Alphabetic = true, or U+00A0 ( ) NO-BREAK SPACE (NBSP), or U+05F3 (׳) HEBREW PUNCTUATION GERESH and not Lower = true and not Upper = true and not GRAPHEME EXTEND = true
Numeric	Linebreak = Numeric (NU)
ATerm	Any of the following characters: U+002E (.) FULL STOP
STerm	STerm = true and not ATerm = true
Close	General_Category = Open_Punctuation (Po), or General_Category = Close_Punctuation (Pe), or Linebreak = Quotation (QU) and not U+05F3 (׳) HEBREW PUNCTUATION GERESH and not ATerm=true and not STerm = true
Any	Any character (includes all of the above)

Note: the extra condition in STerm should really be repaired by removing ATerm from the definition of STerm in the Proplist file.