(file) Return to pr-96.html CVS log (file) (dir) Up to [Development] / draft / review

File: [Development] / draft / review / pr-96.html (download) / (as text)
Revision: 1.18, Mon Apr 30 19:51:04 2007 UTC (2 years, 6 months ago) by mdavis
Branch: MAIN
CVS Tags: HEAD
Changes since 1.17: +1 -1 lines
Updated with information from Paul Nelson on Khmer

<html>

<head>
<meta http-equiv="Content-Language" content="en-us">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<link rel="stylesheet" type="text/css" href="http://www.unicode.org/webscripts/standard_styles.css">
<title>Allowing Joiner Characters in Identifiers</title>
</head>

<body style="margin:2em">

<h1>Public Review Issue #96</h1>
<h2>Allowing Special Characters in Identifiers</h2>
	<table border="0" id="table3" cellspacing="0" cellpadding="3">
		<tr>
			<th nowrap valign="top"><i>Revision 3</i></th>
			<th nowrap valign="top"><i>04-19-2007</i></th>
			<td align="left" valign="top"><i>Significantly tightened the requirements for ZWJ and ZWNJ by reducing the number of possible scripts, and 
			simplifying the sequences. Also added the equivalent characters needed for Mongolian.</i></td>
		</tr>
		</table>
<p><i>This PRI affects the use of special characters (ZWJ, ZWNJ and Mongolian separators) in identifiers. It may be 
relevant in a variety of contexts, including such areas as international domain 
names for Arabic, Persian, Sinhalese, Khmer, and Malayalam. If you believe that there are any other languages requiring the use of special characters, please respond 
as directed on <a href="http://www.unicode.org/review/">http://unicode.org/review</a> and include the PRI number and Revision Number in your message.</i></p>
<p>The use of format characters in identifiers is problematical
because the formatting effects they represent are normally just stylistic or otherwise out of scope for identifiers. To make matters worse, it&#39;s possible to 
misapply format characters such that users can create strings that look the same but actually contain different characters, which can create security problems 
(see
<a class="l" onmousedown="return rwt(this,'','','res','2','AFrqEzdSleaOeSJUbwPTk6wXgRYutt3AVQ','&amp;sig2=uBsGFZy23VyeQNG7b04k7Q')" href="http://www.google.com/url?sa=t&ct=res&cd=2&url=http%3A%2F%2Funicode.org%2Freports%2Ftr36%2F&ei=seMnRp-5EZyajgHD-JnECg&usg=AFrqEzdSleaOeSJUbwPTk6wXgRYutt3AVQ&sig2=uBsGFZy23VyeQNG7b04k7Q">
UTR# 36: <b>Unicode Security</b> Considerations</a>).</p>

<p>For these reasons format characters are normally excluded from
Unicode identifiers. However, visible distinctions created by certain
format characters (particularly the <i>joiner controls</i>) are necessary and
carry meaning in certain languages. A blanket exclusion of format
characters makes it impossible to create identifiers based on
certain words or phrases in those languages. Identifier 
systems that attempt to provide more natural representations of terms, such as 
geographic names, company names, and so on should consider allowing these 
characters, but limited to particular contexts where they are necessary.</p>


<p>The goal for such a restriction of format characters to particular
contexts is to</p>

<p><ol type="a">
<li>allow the use of these characters where required in normal text</li>
<li>exclude as many cases as possible where no visible distinction results</li>
<li>be simple enough to be easily implemented with standard mechanisms
such as regular expressions</li>
</ol>
</p>

<p>Normal usage, as meant here, does not include technical usage such as
mathematical expressions or pedagogical use (eg, illustration of
half-forms or joining forms in isolation).</p>


<h2>Proposal</h2>
<p>Allow joiner controls (U+200C <font size="2">ZERO WIDTH NON-JOINER </font>
[ZWNJ] and U+200D <font size="2">ZERO WIDTH JOINER</font> [ZWJ]) and Mongolian separators (U+202F <font size="2">NARROW NO-BREAK SPACE [NNBSP]</font> and U+180B 
.. U+180D <i>mongolian free variation selectors</i>)&nbsp; in the Unicode recommendations for identifiers, but only in very limited contexts as specified 
below. </p>
<p><b>Script Restriction.</b> In each of the following cases, the specified sequence must only consist of characters from a 
single script (after ignoring <i>Common</i> and <i>Inherited</i> script characters). </p>


<p><b>Performance. </b>Parsing identifiers can be a performance-sensitive task. However, these characters are quite 
rare in practice, thus the regular expressions (or equivalent processing) only rarely would need 
to be invoked. Thus these tests should not add any significant performance cost overall.</p>


<p>The characters and their contexts are given by the following:<br><br><b>A. 
ZWNJ in the following contexts:</b></p>
<ol>
	<li><b>Breaking a cursive connection. </b>That is, in the context based on 
	the Arabic Shaping property, consisting of:<ul>
		<li>A Left-Joining character, followed by zero or more Transparent 
		characters, followed by a ZWNJ, followed by zero or more Transparent 
		characters, followed by a Right-Joining character</li>
		<li>This corresponds to the following regular expression (in Perl-style syntax): <b>/$L <font color="#666666">$T*</font> ZWNJ <font color="#666666">$T*</font> $R/</b><br>where:<ul>
			<li>$T = [:Joining_Type=Transparent:]</li>
			<li>$R = [[:Joining_Type=Dual_Joining:][: Joining_Type=Right_Joining:]] 
			</li>
			<li>$L = [[:Joining_Type=Dual_Joining:][:Joining_Type=Left_Joining:]]<br>&nbsp;</li>
		</ul></li>
		<li><span style="font-weight: bold">Example:</span> Farsi &lt;Noon, Alef, 
		Meem, Heh, Alef, Farsi Yeh&gt;. Without a ZWNJ, it translates to &quot;names&quot;; 
		with a ZWNJ between Heh and Alef, it means &quot;a letter&quot;. Figure 1 illustrates this.<br>
		<center>
			<table style="BORDER-COLLAPSE: collapse; vert-align: top" border="0" id="table2">
				<tr>
					<td>
						<p align="center"><b><br><font size="5">Figure 1.</font></b></td>
				</tr>
				<tr>
					<td><img border="0" src="pr-96.farsi.gif"></td>
				</tr>
			</table>			
			<br>
		</center></li>
	</ul></li>
	<li><b>In a conjunct context. </b>That is, a sequence of the form:<ul>
		<li>A Letter, followed by a 
		Virama, followed by a ZWNJ, followed by an Letter,<br>
		where the Letters and Virama are all in the <i>Malayalam</i> script, or they are all in the <i>Khmer</i> script<ul>
		<li><i>Issue: is the Khmer inclusion required semantically?</i></li>
	</ul></li>
		<li>This corresponds to the following regular expression (in Perl-style syntax): <b>/$L $V ZWNJ $L/</b><br>where:<ul>
			<li>$L = [:General_Category=Letter:]</li>
			<li>$V = [:Canonical_Combining_Class<wbr>=Virama:]</li>
		</ul></li>
		<li><span style="font-weight: bold">Example:</span> In Khmer, U+17A2 U+200D<i>(ZWNJ)</i> U+17CA U+17B7 U+17A2 U+17BB U+17CA U+17C7 [<span lang="KHM"><font size="5">អ៊ិអុ៊ះ</font></span>] 
		is a case where the first TRIISAP needs to be escaped, but the second does not (as there is a below base vowel).</li>
		<li><span style="font-weight: bold">Example: </span>The Malayalam word for <i>eyewitness</i>. The form without the ZWNJ is incorrect in this case.<div align="center">
			<table style="BORDER-COLLAPSE: collapse; vert-align: top" border="0" id="table4">
				<tr>
					<td>
						<p align="center"><b><br><font size="5">Figure 2.</font></b></td>
				</tr>
				<tr>
					<td>
		<img border="0" src="pr-96.malayalam.gif" width="529" height="177"></td>
				</tr>
			</table>			
			</div>
		</li>
	</ul></li>
</ol>
<b><span id="st" name="st" class="st">B. ZWJ</span> in the following context:</b><ol>
	<li><b>In a conjunct context. </b>That is, a sequence of the form:<ul>
		<li>A Letter, followed by a Virama, followed by a ZWJ,<br>
		where the Letter and Virama are both in the Sinhala script</li>
		<li>This corresponds to the following regular expression (in Perl-style syntax): <b>/$L $V <span id="st" name="st" class="st">ZWJ</span>/</b><br>
		where:<ul>
			<li>$L = [:General_Category=Letter:]</li>
			<li>$V = [:Canonical_Combining_Class<wbr>=Virama:]</li>
		</ul></li>
		<li><span style="font-weight: bold">Example: </span>The Sinhala word for 
		the country &#39;Sri Lanka&#39; in Figure 3A, which
		uses both a space character and a ZWJ. Removing the space gives the text 
		in Figure 3B which is still readable, but removing the ZWJ completely
		modifies the appearance of the &#39;Sri&#39; cluster and gives the text in 
		Figure 3C.
		<center>
				<table style="BORDER-COLLAPSE: collapse; vert-align: top" border="0">
					<tr><td>
				<p align="center"><b><br>
				<font size="5">Figure 3.</font></b></td></tr>
					<tr><td><img border="0" src="pr-96.sinhala.gif"></td></tr>
				</table>
		</center>
		</li>
	</ul>
	</li>
</ol>

<b><span id="st0" name="st" class="st">C.</span> Mongolian Separators (NNBSP or MVSs) in the following context:</b><ol>
	<li><b>Between Mongolian Letters. </b>That is, a sequence of the form:<ul>
		<li>A Mongolian Letter, followed by NNBSP or a MVS, followed by a Mongolian Letter.</li>
		<li>This corresponds to the following regular expression (in Perl-style syntax): <b>/$ML $MS $ML/</b><br>
		where:<ul>
			<li>$ML = [[:General_Category=Letter:]&amp;[:Script=Mongolian:]]</li>
			<li>$MS = [\u202F \u180B \u180C \u180D]</li>
		</ul></li>
		<li><span style="font-weight: bold">Example:</span> See pages 454 455 of <i>The Unicode Standard, Version 5.0.</i></li>
	</ul>
	</li>
</ol>
<hr>
<h2>

<font size="4"><span style="font-weight: bold;">Comparison Cases</span></font></h2>
<p>The above description restricts the usage of Joiner and Nonjoiner quite substantially from Revision 1 of this Public Review Issue. This restriction was based on a review of the cases where these 
characters would be required for semantic differences relevant to identifiers. The other specified cases of Joiner or Nonjoiner usage in the Unicode 
Standard were not considered to be required for identifiers. They are listed here for comparison, so that reviewers can look 
over these cases to see if there are good reasons for including them in the above list. </p>
<h3><span style="font-weight: bold;">Non-Semantic </span></h3>
<p>Cases that do not carry semantic differences (or at least differences which are not sufficient to be required in identifiers for modern languages):</p>
<ol>
	<li>Devanagari, Half-forms as in Tables 9-2 and 9-4, pp 309, 311 </li>
	<li>Bengali, Figures 9-10 and 9-11, p 314; and RA + JOINER + VIRAMA + YA, p 316</li>
	<li>Gurmukhi Table 9-10, p 320</li>
	<li>Kannada, p 334</li>
	<li>Myanmar, p 380</li>
	<li>Buginese, Figure 11-5, p 398</li>
	<li>Phags-Pa, Table 10-2 (since Phags-Pa is a historic script, it is not suitable for general purpose identifiers).</li>
	<li><span class="q">Sinhala, use of ZWJ in front of the virama to form touching consonants, &quot;used in classical and Buddhist texts&quot;.</span></li>
</ol>
<h3>
<span style="font-weight: bold;">Superseded</span></h3>
<p>Sequences that have been superseded in usage by other characters, or should be in the near future (the characters having 
already been approved by the Unicode consortium, and slated for Unicode 5.1): </p>
<ol>
	<li>Devanagari, RA + VIRAMA + ZWJ</li>
	<li>Bengali, TA + VIRAMA + ZWJ</li>
	<li>Myanmar, LETTER + VIRAMA + ZWNJ (see <a href="http://www.unicode.org/notes/tn11/">UTN #11</a>)</li>
	<li>Malayalam, LETTER + VIRAMA + ZWJ</li>
</ol>

</body>

</html>

No CVS admin address has been configured
Powered by
ViewCVS 0.9.3