L2/06-353
From:
Mark Davis
Date:
2006-10-24
Subject:
ZWJ/NJ in identifiers
Eric
and I had an action regarding ZWJ/NJ.
Here is a strawman document for the meeting.
Normally format characters are excluded from identifiers, because their usage
allows two apparently identical strings to represent different underlying
strings. However, for historical reasons, certain format characters are used to
mark visible distinctions in particular cases, distinctions that are necessary
for important semantic distinctions in certain languages. Identifier systems
that attempt to provide more natural representations of terms, such as
geographic names, company names, and so on should consider allowing these
characters, but limited to the following contexts.
The match to the regular expressions below must also only consist of characters
from a single script (after ignoring Common and Inherited Script characters).
ZWNJ in the following contexts:
- At a position in a string that causes adjacent characters to break a
cursive connection. That is, in the context based on the Arabic Shaping
using the following regular expression:
- /$R $T? ZWNJ $T? $L/
where:
- $T = [:Joining_Type=Transparent:]
- $R = [[:Joining_Type=Dual_Joining:][: Joining_Type=Right_Joining:]]
- $L = [[:Joining_Type=Dual_Joining:][:Joining_Type=Left_Joining:]]
- Example: Farsi <Noon, Alef,
Meem, Heh, Alef, Farsi Yeh>. Without a ZWNJ, it translates to "names";
with a ZWNJ between Heh and Alef, it means "a letter".
- In a conjunt context. that is a sequence of the form
- /$L $M* $V ZWNJ $M* $L/
where:
- $L = [:General_Category=Letter:]
- $M = [:General_Category=Mark:]
- $V = [:Canonical_Combining_Class=Virama:]
- Example: in Malayalam, we
recommend the use of ZWJ and
ZWNJ to make distinctions involving cillu forms. (See p. 337 of TUS
5.0.) The status changes once the cillu forms are separately encoded in
5.1.
ZWJ in the following contexts:
- In a conjunt context. that is a sequence of the form
- /$L $M* $V ZWJ $M* $L/
where:
- $L = [:General_Category=Letter:]
- $M = [:General_Category=Mark:]
- $V = [:Canonical_Combining_Class=Virama:]
- Example: Devanagari RA +
VIRAMA + ZWJ + KA
- Example: Sinhala 'ශ්රී ලංකා'
(the country 'Sri Lanka'), which uses both a space character and a
ZWJ. Removing the space gives
'ශ්රීලංකා' which is still readable, but removing the
ZWJ completely modifies the
appearance of the 'Sri' cluster and gives the following text: 'ශ්රී
ලංකා'.
Because of the rarity of these characters, this does not have any appreciable
performance implications. Note that while it would be possible to make the
contexts listed above somewhat narrower, in practice there is no advantage to
that, and the above is computationally simpler.