Re: IPA and sorting

From: Martin J. Dürst (mduerst@ifi.unizh.ch)
Date: Wed Sep 24 1997 - 08:52:51 EDT


On Tue, 23 Sep 1997, Kenneth Whistler wrote:

> Michael Everson has suggested:
> >
> > In the Standard there are letters, used with the IPA like LATIN SMALL
> > LETTER ALPHA which sorts with LATIN SMALL LETTER A -- but the current
> > mappings to IPA also use GREEK SMALL LETTER BETA as a basic constituent of
> > the IPA.
> >
> > This will cause havoc in sorting -- and one does sort IPA text, in
> > glossaries etc. -- because two scripts are intermixed.

> The problem with this, as for many other "clone a character to make
> the processing for XXX easier" proposals, is that it has a downside--
> how to keep the two different character straight once they are cloned.

> A preferable solution is to define IPA collation distinctly from
> the default collation for either Latin or Greek. That would allow
> it to be defined more correctly for IPA specifically. This is really
> no different from the special collation overrides required to get
> correct collation for French, Swedish, Japanese, or whatever.
> The default collation rules are just that: default. They don't
> have to be perfect for everything, and in fact cannot be.

I think the problem may lay one layer higher. One may want to
sort IPA with Latin, or as a separate block. This usually
doesn't appear e.g. for French and Swedish, i.e. they are
sorted together, on whatever rules the viewer wants.
We then get the problem that some characters can be in
more than one block. But I just met a case recently where
I realized that we already might have that problem. As
an examlpe, ZWNJ is used in Thai and Khmer to indicate
wordbreaks. For words and phrases in dictionaries, it is
relevant and has to sort before the other letters. For
Arabic, however, I guess it's irrelevant, because it only
affects presentation.

This means that sorting algorithms of a certain level
of sophistication would have to base block decisions
on strings of characters and not on individual codepoints.
For ZWNJ and Thai/Arabic, that shouldn't be too difficult.
For IPA and Latin, it may still be possible, although there
may be cases where an easy distinction between an almost-
Latin-looking IPA string and a Latin string with some
"exotic" additions for a specific language may not be
possible.

So I think that we should rather think about brodeing
our sorting model than just duplicate more codepoints.
That some of them are already duplicated may not be
optimal. But in the IPA section, I only saw epsilon;
gamma there looks different from the Greek gamma.
I didn't find Latin alpha; it may be somewhere else,
for a proper language and not (only) for IPA.

Regards, Martin.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT