Re: Discontiguous Collation Grapheme Clusters

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Mon, 28 May 2012 05:07:28 +0200

UTS#18 is really a mess about collation clusters. But remamber that
collation elements are specific to each language for which they are
defined (including the "root" locale which acts as a pseudo-language
just working as a default option for all languages that don't have
specific collation rules for many characters, and defined ONLY based
on the *core* character properties in the UCD, including the canonical
equivalences).

I do think that all incoherences that occur between source strings in
NFC or NFD or FCD, should be avoided : they should ALWAYS return
canonically equivalent collation elements. Any simplification that do
not respect those canonical equivalences, only for performance
reasons, are not justified : there are ways to make many optimizations
without breaking these equivalences, even if this slows down a bit the
processing for cases that are rare in practice.

Let's return to the core definition of UCA, which is defined as if all
source strings where first converted to NFD. Any implementation that
attempts to avoid the conversion to NFD has to make sure that it still
returns collation elements that are canonically equivalent (even if
their encoding is different).

Otherwise UCA will remain a broken algorithm, which was prematurely
promoted as a standard. For now it's just a 'best effort" algorithm,
which should not be considered as an UTS.

All current issues are to be considered as implementation bugs that
are NOT conforming to the Unicode standard, and that must be corrected
there. Final dot! This means that works should be done in ICU, but the
core definition of UCA should *not* be simplified in such a way that
it becomes impossible to make a working collation algorithm into a
Unicode conforming process: those simplifications (that are
self-contradicting in many cases) should be removed, they are not
necessary and create confusion, plus interoperability issues between
implementations.

2012/5/28 Richard Wordingham <richard.wordingham_at_ntlworld.com>:
> I'm currently reviewing the definition of the Unicode
> Collation Algorithm (as opposed to just trying to comply with it),
> and I came across the concept of collation grapheme clusters, defined in
> UTS#18 'Unicode Regular Expressions'.
>
> For what types of strings are they supposed to be defined?  Any?  NFC?
> NFD?  FCD?  ASCII?
>
> In the English locale (CLDR), what collation clusters does the Tibetan
> script NFC & NFD string <U+0F40 KA, U+0F71 AA, U+0F7A E, U+0F74 U,
> U+0F0B TIBETAN MARK INTERSYLLABIC TSHEG> consist of?  If I assume that
> the variable weight setting of IgnoreSP does apply, I end up with the
> 2 clusters <U+0F40>, <U+0F71, U+0F7A, U+0F74, U+0F0B> if I apply the
> definition given in UTS#10 Version 6.1.0 Section 6.9.1.  If I apply the
> sample code given in UTS#18 Revision 13 Annex B iteratively, I get
> the 4 clusters <U+0F40>, <U+0F71>, <U+0F7A>, <U+0F74, U+0F0B>.  The
> collation look-ups contributing to the collation of the string are for
> <U+0F40>, <U+0F71, U+0F74>, <U+0F7A>, <U+0F0B>.
>
> If I apply the algorithms to the canonically equivalent <U+0F40,
> U+0F71, U+0F74, U+0F7A, U+0F0B>, both definitions yield the 3 clusters
> <U+0F40>, <U+0F71, U+0F74>, <U+0F7A, U+0F0B>, which, apart from TSHEG
> not being in a collation cluster of its own, makes sense.
>
> If I apply the algorithms to the FCD string <U+0F40 KA, U+0FB2
> SUBJOINED-RA, U+0F75 UU, U+0F0B> in the English locale (CLDR based on
> UCA Version 6.1.0), I don't know what to expect from a *compliant*
> implementation, as collation elements should be formed from U+0FB2 and
> *part* of U+0F75.  If I turn to the textual definition in UTS#18 ('A
> collation character is the longest sequence of characters that maps to
> a sequence of one or more collation elements where the first collation
> element has a primary weight and subsequent elements do not, and no
> completely ignorable characters are included.'), I get 3 clusters,
> <U+0F40>, <U+0FB2>, <U+0F75, U+0F0B>, which is reasonable
> linguistically.
>
> If I apply the algorithms to the canonically equivalent NFC & NFD
> string <U+0F40, U+0FB2, U+0F71, U+0F74, U+0F0B>, I currently get 3
> collation clusters <U+0F40>, <U+0FB2, U+0F71>, <U+0F74, U+0F0B>.
> However, the second cluster has two collation elements, both with
> primary weights, so by the textural specification I get 3
> collation clusters <U+0F40>, <U+0FB2>, <U+0F71, U+0F74, U+0F0B>, which
> is reasonable linguistically, but is only a reasonable result because
> the contraction <U+0FB2,U+0F71> (not yet in DUCET) is artificial.
>
> The textual definition does not explain how to handle completely
> ignorable characters and also appears to be unable to find a collation
> cluster in <U+2122 TRADE MARK SIGN>, which yields two collation
> elements with primary weights.  Are there two clusters here, one for
> the 'T' and one for the 'M'?
>
> So, what collation clusters are these strings composed of?  Does anyone
> have a software implementation that yields them?
>
> The strings were:
>
> 0F40 0F71 0F7A 0F74 0F0B
> 0F40 0F71 0F74 0F7A 0F0B
> 0F40 0FB2 0F75 0F0B
> 0F40 0FB2 0F71 0F74 0F0B
> 2122
>
> Another little gem is that when the Hebrew accent 'METEG' is coded
> between the consonants and the vowel (as in the second word of Exodus
> 20:4 in the Leningrad codex), one gets one collation cluster for the
> consonant, one for the METEG, one for the CGJ, and the lonely vowel is
> shunted off into a collation cluster with the next vowel.  (See
> http://scripts.sil.org/cms/scripts/page.php?item_id=Meteg_intheBHS if
> you don't have the BHS to hand.)
>
> Bemusedly,
>
> Richard.
>
Received on Sun May 27 2012 - 22:10:42 CDT

This archive was generated by hypermail 2.2.0 : Sun May 27 2012 - 22:10:42 CDT