Status: Background documentation for PRI #308
Last updated: September 29, 2015
The Proposed Update UAX #29 for Unicode 9.0 (see PRI #306) includes proposed changes in the text segmentation behavior of U+202F NARROW NO-BREAK SPACE (NNBSP) to eliminate certain undesired word boundaries, particularly for Mongolian. This change in word segmentation can be accomplished in more than one way. Two possible options are listed below:
Both of these options result in the desired word segmentation behavior by changing U+202F from its current value (WB=XX) to (WB=EX), thereby inhibiting word boundaries around it, as required in Mongolian. The first approach is more conservative, in that it only impacts the Word_Break property value. The second approach is more elegant, as it requires no ad hoc addition to the derivation of ExtendNumLet and no change to the text of UAX #29. However, the second approach conceivably could impact other implementations besides Mongolian word segmentation.
Note that either approach would not impact line breaking behavior—this change is only intended to modify default word segmentation behavior.
The only other widely noted use for U+202F NNBSP is for representation of the thin non-breaking space (espace fine insécable) regularly seen next to certain punctuation marks in French style typography. However, the word segmentation change for U+202F should have no impact in that context, as ExtendNumLet is explicitly for preventing breaks between letters, but does not prevent the identification of word boundaries next to punctuation marks.
The following two tables summarize the current relevant property values for NNBSP and the proposed changes in those values. For convenience in comparison, the related character NBSP is also listed in the tables—as well as the nominally associated ZWNBSP and its preferred alternative, WORD JOINER (WJ)—so as to show the contrast in properties both before and after the proposed changes for NNBSP.
Code | Abbr | gc | lb | WB | SB | WSpace | bc |
---|---|---|---|---|---|---|---|
202F | NNBSP | Zs | GL | XX | Sp | Y | CS |
00A0 | NBSP | Zs | GL | XX | Sp | Y | CS |
2060 | WJ | Cf | WJ | Format | Format | N | BN |
FEFF | ZWNBSP | Cf | WJ | Format | Format | N | BN |
Code | Abbr | gc | lb | WB | SB | WSpace | bc |
---|---|---|---|---|---|---|---|
202F | NNBSP | Pc | GL | EX | Sp | Y | CS |
00A0 | NBSP | Zs | GL | XX | Sp | Y | CS |
2060 | WJ | Cf | WJ | Format | Format | N | BN |
FEFF | ZWNBSP | Cf | WJ | Format | Format | N | BN |
It is not being proposed to modify the White_Space property value of U+202F, even though gc=Pc is typically used for visible, connecting punctuation marks. U+202F would become a special case of a connecting punctuation mark with no visible glyph—in other words, a small visible gap which nevertheless does not break words. If the White_Space property value were to be modified to White_Space=N, in an effort to keep the set relationship between White_Space=Y and gc=Zs more consistent, then the Sentence_Break classification of U+202F might also need to be updated, to prevent any anomalous formation of a sentence boundary at U+202F internal to a word segment. No Sentence_Break updates are needed for that purpose, if the changes for U+202F are constrained to one of the two options noted above.
The UTC would appreciate feedback on these options, including information about any implementations which might be adversely affected, in particular, by changing the General_Category of U+202F.