Re: Long-Encoded Restricted Characters in High Frequency Modern Use

From: Mark Davis ☕️ <mark_at_macchiato.com>
Date: Sat, 31 May 2014 21:27:55 +0200

Mark <https://google.com/+MarkDavis>

 *— Il meglio è l’inimico del bene —*

On Fri, May 30, 2014 at 12:39 AM, Richard Wordingham <
richard.wordingham_at_ntlworld.com> wrote:

> I am a little confused by the call for a review of UTS #39, Unicode
> Security Mechanisms (PRI #273). Are we being requested to
> report long-encoded 'restricted' characters in high frequency modern
> use? 'Restricted' refers to the classification in
> xidmodifications.txt.
>

​First, "restricted" are meant not for everyday use, bu​t specifically just
for the purpose of programming identifiers and similar sorts of
identifiers. Moreover, it sets up a framework, but the conformance
requirements are only that any modification is declared.

http://www.unicode.org/reports/tr39/proposed.html#C1

You may know this all, but just to be sure.


>
> One linked pair of long-encoded restricted characters in high frequency
> use is U+0E33 THAI CHARACTER SARA AM and U+0EB3 LAO VOWEL SIGN AM,
> which occurs in the extremely common Thai and Lao words for 'water' or
> 'liquid in general' น้ำ ນ້ຳ whose NFKC decompositions are the
> nonsensical forms น้ํา ນ້ໍາ, but may be faked by the linguistically
> incorrect นํ้า ນໍ້າ. In Thai the encodings are <U+0E19 THAI CHARACTER
> NO NU, U+0E49 THAI CHARACTER MAI THO, U+0E33 THAI CHARACTER SARA AM>,
> <U+0E19, U+0E49, U+0E4D THAI CHARACTER NIKHAHIT, U+0E32 THAI CHARACTER
> SARA AA> and <U+0E19, U+0E49, U+0E4D, U+0E49, U+0E32>.

The structure of the data is based on the use of NFKC characters in
identifiers. So SARA AM and the Lao​ equivalent are both not NFKC
characters, and are categorized as such, and would need to be represented
by their NFKC fors. The process is in
http://www.unicode.org/reports/tr39/proposed.html#IDMOD_Data_Collection

You can see the categorization (for 6.3) for a whole script with a link
like:

http://unicode.org/cldr/utility/list-unicodeset.jsp?g=identifier-restriction&a=\p{sc=thai}

(It only works for 6.3 right now, but these items haven't changed recently.)

> Now, U+0E4D THAI
> CHARACTER NIKHAHIT is classified as 'allowed; recommended', although
> its main use is in writing Pali, which would suggest that it should be
> 'restricted; historic' or 'restricted; limited-use'.

​For that, it would be best to submit via
http://www.unicode.org/reports/tr39/proposed.html#Feedback, AND file a
feedback form at http://www.unicode.org/reporting.html, just to be sure.

> The situation is
> not so clear for Lao
> - U+0ECD LAO NIGGAHITA is a fairly common vowel in the Lao language.
>

​Based on your information, ​the following appear (at least to me) to be
caused by typos in in the xidmodifications source files; they are all
marked as 'technical'.

http://unicode.org/cldr/utility/list-unicodeset.jsp?g=identifier-restriction&a=\p{sc=khmer}

Again, best to submit this like above (via
http://www.unicode.org/reports/tr39/proposed.html#Feedback, AND file a
feedback form at http://www.unicode.org/reporting.html).

> To me, a truly bizarre set of 'restricted' characters is U+17CB KHMER
> SIGN BANTOC to U+17D0 KHMER SIGN SAMYOK SANNYA, which are categorised as
> 'restricted; technical'. They are all in use in the Khmer language.
>
> U+17CB KHMER SIGN BANTOC is required for the main methods of writing
> the Khmer vowels /a/ and /ɑ/.
>
> U+17CC KHMER SIGN ROBAT is a repha, but I would be surprised to learn
> that it has recently become little-used. It is, however, readily
> confused with U+17CC KHMER SIGN TOANDAKHIAT, a 'pure killer' whose main
> modern use is to show that a consonant is silent, rather like the Thai
> letter U+0E4C THAI CHARACTER THANTHAKHAT. (The names are the same.)
> The confusion arises because Sanskrit -rCa was pronounced /-r/ in
> Khmer, and final /r/ recently became silent in Khmer, so the effect of
> the Sanskrit /r/ is now to silence the final consonant.
>
> While U+17CE KHMER SIGN KAKABAT and U+17CF KHMER SIGN AHSDA may not be
> common, they are still in modern use.
>
> Although U+17D0 KHMER SIGN SAMYOK SANNYA may have declined in
> frequency, it has not dropped out of use and is still a common enough
> way of writing the vowel /a/.
>

> Richard.
>
> _______________________________________________
> Unicode mailing list
> Unicode_at_unicode.org
> http://unicode.org/mailman/listinfo/unicode
>

_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Sat May 31 2014 - 14:29:40 CDT

This archive was generated by hypermail 2.2.0 : Sat May 31 2014 - 14:29:40 CDT