L2/06-024

Subject: ZWJ/ZWNJ in Identifiers
Source: Mark Davis
Date: 2006/01/26

ZWJ/ZWNJ in Identifiers

Michel Suignard wrote about the issue of ZWJ and ZWNJ in Internationalized  
Domain Names. The following is an email trail discussing the issue, which  
is relevant not only for IDN but for general identifiers (UAX #31).

Michel:

> In the past few weeks, I have being doing some research on idn.idn,  
looking for what would be the strings for the root level ccTLDs in their  
native writing.
> Doing this, I found a troubling issue. Basically for two countries: Sri  
Lanka and Myanmar, you need to use the ZWJ character (200D Zero Width  
Joiner) to display correctly the country name. But because ZWJ is  
prohibited by Nameprep, you can't display correctly the native name of  
these two countries in a domain name. Removing the ZWJ completely from the  
name alters the rendering in something that is not even close to the  
intended rendering.
>
> A good example is shown in http://en.wikipedia.org/wiki/Sri_Lanka where  
the Sinhala image (in the top right corner) has it right while the inline  
text representation has it wrong (because it does not include the ZWJ in  
the 'Sri' cluster).
>
> A similar case exists for Myanmar, although evidences are harder to  
produce as Unicode examples on how to write it are rare (in fact we had to  
make it based on visual and the Unicode standard info on Myanmar) and  
Unicode compliant fonts are even rarer.
>
> Similar cases are likely to exist for ZWNJ (200C) which also creates  
significantly different visual rendering, although none seems to affect  
country names. ZWJ and ZWNJ are used for some scripts from South and South  
East Asia which use the Virama model to modify the rendering of 'dead'  
consonants.
> It looks like the ZWJ/ZWNJ processing in Nameprep/Stringprep could  
require further study.



Mark:


> It definitely raises the issue when you can't spell Srilanka in IDN. ZWJ  
and ZWNJ, and other default-ignorable characters, are disallowed from  
identifiers, and in the results of StringPrep, precisely for security  
reasons. You don't want characters that are normally invisible to be making  
a difference in identifiers. This was discussed at some length in the UTC.  
However, I talked over the issue with some people on the ICU team, and one  
possible approach we could take to this issue is to add an identifier  
profile to accomodate it, in one or both of #31 or #39:  
http://www.unicode.org/reports/tr31/tr31-6.html  
http://www.unicode.org/reports/tr39/tr39-1.html This profile would retain  
ZWJ or ZWNJ (or possibly other characters) in very specific cases, those  
where it is known to mark a semantic difference (and have a visual  
display). And these contexts would have to be machine-testable. For  
example, the profile might be contain a list like: Retain ZWNJ in the  
following contexts: 1. before = [:ccc=virama:] 2. ... We could then  
continue to recommend a particular identifier profile for IDN, one that  
encompassed this. I really don't think we want to modify the standard  
definition of identifiers, since that is -- by design -- aimed at mimicking  
the normal programmer usage for identifiers of the grammar: id ::= <start>  
<continue>* However, as a profile it works, and is something we could  
incorporate. And StringPrep (http://ietf.org/rfc/rfc3454.txt) already  
contains a clause of some complexity for BIDI, so that wouldn't be a  
stretch there:   1) The characters in section 5.8 MUST be prohibited.   2)  
If a string contains any RandALCat character, the string MUST NOT       
contain any LCat character.   3) If a string contains any RandALCat  
character, a RandALCat      character MUST be the first character of the  
string, and a      RandALCat character MUST be the last character of the  
string. If we were to do this, we would need to identify precisely those  
characters that were at issue, and precisely the contexts where they needed  
to be retained. We really want these limited to *only* where there is both  
a visual difference and an important semantic difference (such as the  
existence of a minimal pair of different words that are identical other  
than these characters). Michel, if these seems reasonable, perhaps you  
could ask Peter and some of the other MS experts to come up with a list.  
The only other case I know of is something similar in Farsi, where  
characters need to break -- and it has a semantic difference. We can then  
prepare a paper for the UTC. Here is some background info. A. List of  
characters currently deleted (note: not prohibited, but deleted and thus  
ignored) in StringPrep. Note that these are limited to U3.2. 3.1 Commonly  
mapped to nothing   The following characters are simply deleted from the  
input (that is,   they are mapped to nothing) because their presence or  
absence in   protocol identifiers should not make two strings different.   
They are   listed in Table B.1.   Some characters are only useful in  
line-based text, and are otherwise   invisible and ignored.   00AD; SOFT  
HYPHEN   1806; MONGOLIAN TODO SOFT HYPHEN   200B; ZERO WIDTH SPACE   2060;  
WORD JOINER   FEFF; ZERO WIDTH NO-BREAK SPACE   Some characters affect  
glyph choice and glyph placement, but do not   bear semantics.   034F;  
COMBINING GRAPHEME JOINER   180B; MONGOLIAN FREE VARIATION SELECTOR ONE    
180C; MONGOLIAN FREE VARIATION SELECTOR TWO   180D; MONGOLIAN FREE  
VARIATION SELECTOR THREE   200C; ZERO WIDTH NON-JOINER   200D; ZERO WIDTH  
JOINER   FE00; VARIATION SELECTOR-1


> ...   FE0F; VARIATION SELECTOR-16 B. List of characters prohibited in  
StringPrep 5.2 Control characters   Control characters (or characters with  
control function) cannot be   seen and can cause unpredictable results when  
displayed.  Note that   the list below is split into two tables in  
appendix C: Table C.2.1   contains the ASCII code points, while Table C.2.2  
contains the non-   ASCII code points.  Most profiles of this document  
that want to   prohibit control characters will want to include both  
tables.   0000-001F; [CONTROL CHARACTERS]   007F; DELETE   0080-009F;  
[CONTROL CHARACTERS]   06DD; ARABIC END OF AYAH   070F; SYRIAC ABBREVIATION  
MARK   180E; MONGOLIAN VOWEL SEPARATOR   200C; ZERO WIDTH NON-JOINER    
200D; ZERO WIDTH JOINER   2028; LINE SEPARATOR   2029; PARAGRAPH SEPARATOR   
 2060; WORD JOINER   2061; FUNCTION APPLICATION   2062; INVISIBLE TIMES    
2063; INVISIBLE SEPARATOR   206A-206F; [CONTROL CHARACTERS]   FEFF; ZERO  
WIDTH NO-BREAK SPACE   FFF9-FFFC; [CONTROL CHARACTERS]   1D173-1D17A;  
[MUSICAL CONTROL CHARACTERS] C. Invisible characters Here is a list of  
"interesting" characters for comparison, formed by taking  
default-ignorables and subtracting. [[:defaultignorablecodepoint:] - [:cc:]  
- [:cs:] - [:cn:] - [:noncharactercodepoint:] - [:Deprecated:] -  
[:Bidi_Control:] - [:Block=Tags:] - [:Block=Musical_Symbols:] -  
[:Block=Variation_Selectors:] - [:Block=Variation_Selectors_Supplement:]]  
00AD SOFT HYPHEN 034F COMBINING GRAPHEME JOINER 0600 ARABIC NUMBER SIGN  
0601 ARABIC SIGN SANAH 0602 ARABIC FOOTNOTE MARKER 0603 ARABIC SIGN SAFHA  
06DD ARABIC END OF AYAH 070F SYRIAC ABBREVIATION MARK 115F HANGUL CHOSEONG  
FILLER 1160 HANGUL JUNGSEONG FILLER 17B4 KHMER VOWEL INHERENT AQ 17B5 KHMER  
VOWEL INHERENT AA 180B MONGOLIAN FREE VARIATION SELECTOR ONE 180C  
MONGOLIAN FREE VARIATION SELECTOR TWO 180D MONGOLIAN FREE VARIATION  
SELECTOR THREE 200B ZERO WIDTH SPACE 200C ZERO WIDTH NON-JOINER 200D ZERO  
WIDTH JOINER 200E LEFT-TO-RIGHT MARK 200F RIGHT-TO-LEFT MARK 202A  
LEFT-TO-RIGHT EMBEDDING 202B RIGHT-TO-LEFT EMBEDDING 202C POP DIRECTIONAL  
FORMATTING 202D LEFT-TO-RIGHT OVERRIDE 202E RIGHT-TO-LEFT OVERRIDE 2060  
WORD JOINER 2061 FUNCTION APPLICATION 2062 INVISIBLE TIMES 2063 INVISIBLE  
SEPARATOR 3164 HANGUL FILLER FEFF ZERO WIDTH NO-BREAK SPACE FFA0 HALFWIDTH  
HANGUL FILLER


Michel:


> Before I sent the original message, I had some chat with Peter and we
> also explored a similar idea. In short, it looks like a big can of
> worms, because the exclusion rules are not that simple to write and even
> worse, can depend upon the layout engine and the font features. It is
> worth investigating, but it won't happen that fast as it would require
> finding a common denominator among all layout/font that is deemed
> essential to preserve visually without creating additional visual
> confusability.
>
> As often, the devil is in the details. But I agree that introducing such
> a concept in either #31 or #39 is a good idea.
>
> The problem of course is that it won't solve anything for current IDNA
> where it is now excluded.


Mark:


>> In short, it looks like a big can of worms, because the exclusion rules  
are not that simple to write and even worse, can depend upon the layout  
engine and the font features.
>
>
> That should not be required. We should only concentrate on cases where  
there is a true semantic difference, and ideally a visual difference. So it  
should not depend on layout engine or font features. Perhaps I should just  
make up a paper on the basis of what I wrote, and leave room for the  
discussion of the issues.
>
>> The problem of course is that it won't solve anything for current IDNA  
where it is now excluded.
>
>
> While it won't solve anything for the current IDNA, it should be further  
evidence of the need to upgrade. BTW, as originally designed, ZWJ and ZWNJ  
were really only for *exceptional* cases, and only for rendering (not  
semantic) differences. I think over time we have unfortunately drifted away  
from that, but it would help if you would explain why it is that the word  
Srilanka needs the ZWJ or ZWNJ; why does the normal rendering of the  
sequence of letters work?


Michel:


> My understanding is that Sri Lanka is written as:
> \x0dc1\x0dca\x200d\x0dbb\x0dd3\x0020\x0dbd\x0d82\x0d9a\x0dcf
>
> 'Sri' uses a special form of the consonant conjunct 'shr'. I am really
> not a South Asian script expert so I have to take Peter's and others'
> words in that aspect.
>
> For Myanmar, we came with:
> \x1019\x1039\x101B\x1014\x1039\x200C\x1019\x102C
>
> (The ZWNJ makes the 2nd virama visible)
>
> If we need ZWJ/ZWNJ to display two of the South Asian country names it
> seems to me that the original mandate for usage of ZWJ/ZWNJ in that
> region has failed miserably.