Re: Text in composed normalized form is king, right? Does anyone generate text in decomposed normalized form?

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Sun, 10 Feb 2013 21:13:36 +0000

On Sun, 10 Feb 2013 12:21:05 +0100
Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

> 2013/2/7 Richard Wordingham <richard.wordingham_at_ntlworld.com>:
> > You said, on 5 February,
> >
> > "A process can be FULLY conforming by preserving the canonical
> > equivalence and treating ALL strings that are canonically
> > equivalent, without having to normalize them in any recommanded
> > form, or performing any reordering in its backing store, or it can
> > choose to normalize to any other form that is convenient for that
> > process (so it could be NFC or NFD, or something else)"
> >
> > There's no qualification there disqualifying collation at the
> > secondary level from being a 'process' which may or may not be
> > conforming.
>
> Citing this email, the restriction to primary level was included
> before this sentence, and implied.

The first mention of any restriction to the primary level was in the
paragraph *following* the one I quoted above.

> You just did not quote it along
> with this. Be careful about taking senetencves out of their contexts,
> when the whole thread started by spekaing about primary level only for
> basic searches.
>
> OK there are some pathological cases but they are really constructed
> and not made for modern languages (except a fex Indic ones as you
> noted), but none of them that concern the Latin script (your <TILDE+V>
> example collating like <N> is not an effective true example, it is
> fully constructed and not found in the CLDR).

'Pathological' = not amenable to naïve processinɡ.

Tibetan isn't in the CLDR yet, and several scripts have no
representative yet, although the default collation is inappropriate
for the major languages. I also note that there is as yet no Sanskrit
collation! In short, CLDR is far from complete so far as collation is
concerned.

The example was <TILDE+v> collating like <nv>.

> If you just consider the initial question, having to decompose letters
> to "recompose" them in defective ways just to create rare single
> collation elements remains a very borderline case for applications
> like browsers that just perform plain-text search at primary level on
> a web page. Even if the implementation really uses a full
> decomposition, I doubt it even has any implemented tailoring that
> would recognize those defective collation elements

You're now making me wonder if Danish "<U+0061 LATIN SMALL LETTER A,
U+00E1 LATIN SMALL LETTER A WITH ACUTE>" and <U+00E1, U+0061> would get
the correct primary matching! Note that the acute accent serves as
a punctuation mark in Danish.

'Defective' collation elements should not be a problem if one can
force decomposition. What seems odd to me is the UCA rule that, for
Danish, the string "aar", composed of collating elements "aa" and "r",
should have a match in "baaar", which consists of collating elements
"b", "aa", "a" and "r", in that order.

There are two problems that NFD addresses - merger of base character
and mark in one character, and the order of combining marks.

For primary collation, merger becomes a problem whenever characters
need to be split between collating elements. In Danish "aaa" is a
problem because one has to choose between collating element sequences
"aa" and "a" on one hand and "a" and "aa" on the other. The issue
becomes clearer when one replaces "aa" with "å", which is only
distinguished at the tertiary level. "aå" is a challenge for
formally correct Danish collation if one does not decompose the
characters. This problem one can solve at a formal level by adding many
more collating elements. In general, however, one cannot solve the
problem just by adding finitely many more collating elements.

Order is a problem when one has collating elements composed of multiple
characters of different non-zero canonical combining classes. In
practice this could be solved by adding more collating elements, but
in theory the number of combinations to be considered could be
unbounded. The UCA defines the interpretation in terms of the NFD
form, and occasionally it is necessary to reduce strings to NFD form to
determine this interpretation. Only having to consider primary weights
can reduce this problem, but it does not always remove the problem.

Richard.
Received on Sun Feb 10 2013 - 15:20:34 CST

This archive was generated by hypermail 2.2.0 : Sun Feb 10 2013 - 15:20:36 CST