L2/06-024 Subject: ZWJ/ZWNJ in Identifiers Source: Mark Davis Date: 2006/01/26 ZWJ/ZWNJ in Identifiers Michel Suignard wrote about the issue of ZWJ and ZWNJ in Internationalized Domain Names. The following is an email trail discussing the issue, which is relevant not only for IDN but for general identifiers (UAX #31). Michel: > In the past few weeks, I have being doing some research on idn.idn, looking for what would be the strings for the root level ccTLDs in their native writing. > Doing this, I found a troubling issue. Basically for two countries: Sri Lanka and Myanmar, you need to use the ZWJ character (200D Zero Width Joiner) to display correctly the country name. But because ZWJ is prohibited by Nameprep, you can't display correctly the native name of these two countries in a domain name. Removing the ZWJ completely from the name alters the rendering in something that is not even close to the intended rendering. > > A good example is shown in http://en.wikipedia.org/wiki/Sri_Lanka where the Sinhala image (in the top right corner) has it right while the inline text representation has it wrong (because it does not include the ZWJ in the 'Sri' cluster). > > A similar case exists for Myanmar, although evidences are harder to produce as Unicode examples on how to write it are rare (in fact we had to make it based on visual and the Unicode standard info on Myanmar) and Unicode compliant fonts are even rarer. > > Similar cases are likely to exist for ZWNJ (200C) which also creates significantly different visual rendering, although none seems to affect country names. ZWJ and ZWNJ are used for some scripts from South and South East Asia which use the Virama model to modify the rendering of 'dead' consonants. > It looks like the ZWJ/ZWNJ processing in Nameprep/Stringprep could require further study. Mark: > It definitely raises the issue when you can't spell Srilanka in IDN. ZWJ and ZWNJ, and other default-ignorable characters, are disallowed from identifiers, and in the results of StringPrep, precisely for security reasons. You don't want characters that are normally invisible to be making a difference in identifiers. This was discussed at some length in the UTC. However, I talked over the issue with some people on the ICU team, and one possible approach we could take to this issue is to add an identifier profile to accomodate it, in one or both of #31 or #39: http://www.unicode.org/reports/tr31/tr31-6.html http://www.unicode.org/reports/tr39/tr39-1.html This profile would retain ZWJ or ZWNJ (or possibly other characters) in very specific cases, those where it is known to mark a semantic difference (and have a visual display). And these contexts would have to be machine-testable. For example, the profile might be contain a list like: Retain ZWNJ in the following contexts: 1. before = [:ccc=virama:] 2. ... We could then continue to recommend a particular identifier profile for IDN, one that encompassed this. I really don't think we want to modify the standard definition of identifiers, since that is -- by design -- aimed at mimicking the normal programmer usage for identifiers of the grammar: id ::= * However, as a profile it works, and is something we could incorporate. And StringPrep (http://ietf.org/rfc/rfc3454.txt) already contains a clause of some complexity for BIDI, so that wouldn't be a stretch there: 1) The characters in section 5.8 MUST be prohibited. 2) If a string contains any RandALCat character, the string MUST NOT contain any LCat character. 3) If a string contains any RandALCat character, a RandALCat character MUST be the first character of the string, and a RandALCat character MUST be the last character of the string. If we were to do this, we would need to identify precisely those characters that were at issue, and precisely the contexts where they needed to be retained. We really want these limited to *only* where there is both a visual difference and an important semantic difference (such as the existence of a minimal pair of different words that are identical other than these characters). Michel, if these seems reasonable, perhaps you could ask Peter and some of the other MS experts to come up with a list. The only other case I know of is something similar in Farsi, where characters need to break -- and it has a semantic difference. We can then prepare a paper for the UTC. Here is some background info. A. List of characters currently deleted (note: not prohibited, but deleted and thus ignored) in StringPrep. Note that these are limited to U3.2. 3.1 Commonly mapped to nothing The following characters are simply deleted from the input (that is, they are mapped to nothing) because their presence or absence in protocol identifiers should not make two strings different. They are listed in Table B.1. Some characters are only useful in line-based text, and are otherwise invisible and ignored. 00AD; SOFT HYPHEN 1806; MONGOLIAN TODO SOFT HYPHEN 200B; ZERO WIDTH SPACE 2060; WORD JOINER FEFF; ZERO WIDTH NO-BREAK SPACE Some characters affect glyph choice and glyph placement, but do not bear semantics. 034F; COMBINING GRAPHEME JOINER 180B; MONGOLIAN FREE VARIATION SELECTOR ONE 180C; MONGOLIAN FREE VARIATION SELECTOR TWO 180D; MONGOLIAN FREE VARIATION SELECTOR THREE 200C; ZERO WIDTH NON-JOINER 200D; ZERO WIDTH JOINER FE00; VARIATION SELECTOR-1 > ... FE0F; VARIATION SELECTOR-16 B. List of characters prohibited in StringPrep 5.2 Control characters Control characters (or characters with control function) cannot be seen and can cause unpredictable results when displayed. Note that the list below is split into two tables in appendix C: Table C.2.1 contains the ASCII code points, while Table C.2.2 contains the non- ASCII code points. Most profiles of this document that want to prohibit control characters will want to include both tables. 0000-001F; [CONTROL CHARACTERS] 007F; DELETE 0080-009F; [CONTROL CHARACTERS] 06DD; ARABIC END OF AYAH 070F; SYRIAC ABBREVIATION MARK 180E; MONGOLIAN VOWEL SEPARATOR 200C; ZERO WIDTH NON-JOINER 200D; ZERO WIDTH JOINER 2028; LINE SEPARATOR 2029; PARAGRAPH SEPARATOR 2060; WORD JOINER 2061; FUNCTION APPLICATION 2062; INVISIBLE TIMES 2063; INVISIBLE SEPARATOR 206A-206F; [CONTROL CHARACTERS] FEFF; ZERO WIDTH NO-BREAK SPACE FFF9-FFFC; [CONTROL CHARACTERS] 1D173-1D17A; [MUSICAL CONTROL CHARACTERS] C. Invisible characters Here is a list of "interesting" characters for comparison, formed by taking default-ignorables and subtracting. [[:defaultignorablecodepoint:] - [:cc:] - [:cs:] - [:cn:] - [:noncharactercodepoint:] - [:Deprecated:] - [:Bidi_Control:] - [:Block=Tags:] - [:Block=Musical_Symbols:] - [:Block=Variation_Selectors:] - [:Block=Variation_Selectors_Supplement:]] 00AD SOFT HYPHEN 034F COMBINING GRAPHEME JOINER 0600 ARABIC NUMBER SIGN 0601 ARABIC SIGN SANAH 0602 ARABIC FOOTNOTE MARKER 0603 ARABIC SIGN SAFHA 06DD ARABIC END OF AYAH 070F SYRIAC ABBREVIATION MARK 115F HANGUL CHOSEONG FILLER 1160 HANGUL JUNGSEONG FILLER 17B4 KHMER VOWEL INHERENT AQ 17B5 KHMER VOWEL INHERENT AA 180B MONGOLIAN FREE VARIATION SELECTOR ONE 180C MONGOLIAN FREE VARIATION SELECTOR TWO 180D MONGOLIAN FREE VARIATION SELECTOR THREE 200B ZERO WIDTH SPACE 200C ZERO WIDTH NON-JOINER 200D ZERO WIDTH JOINER 200E LEFT-TO-RIGHT MARK 200F RIGHT-TO-LEFT MARK 202A LEFT-TO-RIGHT EMBEDDING 202B RIGHT-TO-LEFT EMBEDDING 202C POP DIRECTIONAL FORMATTING 202D LEFT-TO-RIGHT OVERRIDE 202E RIGHT-TO-LEFT OVERRIDE 2060 WORD JOINER 2061 FUNCTION APPLICATION 2062 INVISIBLE TIMES 2063 INVISIBLE SEPARATOR 3164 HANGUL FILLER FEFF ZERO WIDTH NO-BREAK SPACE FFA0 HALFWIDTH HANGUL FILLER Michel: > Before I sent the original message, I had some chat with Peter and we > also explored a similar idea. In short, it looks like a big can of > worms, because the exclusion rules are not that simple to write and even > worse, can depend upon the layout engine and the font features. It is > worth investigating, but it won't happen that fast as it would require > finding a common denominator among all layout/font that is deemed > essential to preserve visually without creating additional visual > confusability. > > As often, the devil is in the details. But I agree that introducing such > a concept in either #31 or #39 is a good idea. > > The problem of course is that it won't solve anything for current IDNA > where it is now excluded. Mark: >> In short, it looks like a big can of worms, because the exclusion rules are not that simple to write and even worse, can depend upon the layout engine and the font features. > > > That should not be required. We should only concentrate on cases where there is a true semantic difference, and ideally a visual difference. So it should not depend on layout engine or font features. Perhaps I should just make up a paper on the basis of what I wrote, and leave room for the discussion of the issues. > >> The problem of course is that it won't solve anything for current IDNA where it is now excluded. > > > While it won't solve anything for the current IDNA, it should be further evidence of the need to upgrade. BTW, as originally designed, ZWJ and ZWNJ were really only for *exceptional* cases, and only for rendering (not semantic) differences. I think over time we have unfortunately drifted away from that, but it would help if you would explain why it is that the word Srilanka needs the ZWJ or ZWNJ; why does the normal rendering of the sequence of letters work? Michel: > My understanding is that Sri Lanka is written as: > \x0dc1\x0dca\x200d\x0dbb\x0dd3\x0020\x0dbd\x0d82\x0d9a\x0dcf > > 'Sri' uses a special form of the consonant conjunct 'shr'. I am really > not a South Asian script expert so I have to take Peter's and others' > words in that aspect. > > For Myanmar, we came with: > \x1019\x1039\x101B\x1014\x1039\x200C\x1019\x102C > > (The ZWNJ makes the 2nd virama visible) > > If we need ZWJ/ZWNJ to display two of the South Asian country names it > seems to me that the original mandate for usage of ZWJ/ZWNJ in that > region has failed miserably.