Re: Text in composed normalized form is king, right? Does anyone generate text in decomposed normalized form? from Richard Wordingham on 2013-02-07 (Unicode Mail List Archive)

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Thu, 7 Feb 2013 22:13:35 +0000

On Wed, 6 Feb 2013 10:18:33 +0100
Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

> 2013/2/5 Richard Wordingham <richard.wordingham_at_ntlworld.com>:

> > Try doing UCA collation with <U+0302 COMBINING CIRCUMFLEX ACCENT,
> > U+0067 LATIN SMALL LETTER G> being a collation element (with
> > arbitrary collation elements) without doing normalisation.
>
> <0302, 0067> is defective, and its normalisation is still <0302,
> 0067>, it is NOT canonically equivalent to <0067, 0302>
>
> I was not speaking about arbitrary collation elements containing
> defective sequences, is is a real case ?

This wasn't, but the mediaeval use of tilde to abbreviate a nasal
consonant comes tantalisingly close. The CLDR collation has entries for
for <U+0C82 KANNADA SIGN ANUSVARA, U+0C95 KANNADA LETTER KA> (a
defective string) and other combinations making anusvara almost
equivalent to the homorganic nasal. The European analogue would be to
make <U+0303 COMBINING TILDE, U+0076 LATIN SMALL LETTER V> sort almost
the same as <006E LATIN SMALL LETTER N, U+0076>, and then a repeated
sequence of instances of U+1E7D LATIN SMALL LETTER V WITH TILDE would
require canonical decomposition to collate in accordance with the rules.

I've already mentioned Burmese as having defective sequences containing
letters (category L). There is a third language in CLDR having such
sequences, but these collating elements are only to support mistypings
of U+0E33 THAI CHARACTER SARA AM.

Richard.
Received on Thu Feb 07 2013 - 16:21:59 CST

This archive was generated by hypermail 2.2.0 : Thu Feb 07 2013 - 16:22:10 CST