Re: Can NFKC turn valid UAX 31 identifiers into non-identifiers?

From: Joan Montané via Unicode <unicode_at_unicode.org>
Date: Thu, 7 Jun 2018 13:32:13 +0200

2018-06-04 21:49 GMT+02:00 Manish Goregaokar via Unicode <
unicode_at_unicode.org>:

> Hi,
>
> The Rust community is considering
> <https://github.com/rust-lang/rfcs/pull/2457> adding non-ascii
> identifiers, which follow UAX #31 <http://www.unicode.org/reports/tr31/>
> (XID_Start XID_Continue*, with tweaks). The proposal also asks for
> identifiers to be treated as equivalent under NFKC.
>
> Are there any cases where this will lead to inconsistencies? I.e. can the
> NFKC of a valid UAX 31 ident be invalid UAX 31?
>

Yes, such case exists, for instance in Latin alphabet and Catalan language.

* Ŀ, LATIN CAPITAL LETTER L WITH MIDDEL DOT <U+013F> NFKC decomposes to
LATIN CAPITAL LETTER L (U+004C) MIDDLE DOT (U+00B7): <L,·>
* ŀ, LATIN SMALL LETTER L WITH MIDDLE DOT <U+0140> NFKC decomposes to LATIN
SMALL LETTER L (U+006C) MIDDLE DOT (U+00B7): <l,·>

Ŀ and ŀ are (were) used for Catalan language for encoding geminate L [1]
when it is (was) encoded using 2 chars only. Preferred (and common used)
encoding is currently that of 3 chaacters: <L,·,L>. So, some adjustments
are needed if you whant to support Catalan language identifiers [2]

Yours,
Joan Montané

[1] https://en.wikipedia.org/wiki/Interpunct#Catalan
[2] http://www.unicode.org/reports/tr31/#Specific_Character_Adjustments
Received on Thu Jun 07 2018 - 06:32:37 CDT

This archive was generated by hypermail 2.2.0 : Thu Jun 07 2018 - 06:32:37 CDT