RE: Sorting words in latin based languages

From: Addison Phillips (AddisonP@simultrans.com)
Date: Sat Jan 09 1999 - 14:55:29 EST


Even English is not exempt from non-ASCII sorting issues.

Consider the word résumé or the loan phrase vis-à-vis. Both are correct in
English with the accents.

There is also the problem of non-alpha marks (punctuation for example).

Sorting is a highly complex, regionally and culturally affected problem in
ANY language (of course Esperanto, as a created language, may avoid aspects
of this problem). We've been busy here this week writing a French sorting
algorithm for an embedded application and so I am amused at the
synchronicity of this: in case I had forgotten, proper, linguistically aware
sorting still obtains a (non-trivial) performance penalty simply because the
rules are more complex than you suspect... ;-)

Thanks,

Addison

-----Original Message-----
From: John Cowan [mailto:cowan@locke.ccil.org]
Sent: Samstag, 9. Januar 1999 10:50
To: Unicode List
Subject: Re: Sorting words in latin based languages

William Overington scripsit:

> When wishing to sort Esperanto words alphabetically by software a problem
> thus arises that does not occur with the sorting of English words where
the
> numerical order of the numerical values of the code elements is the same
> order as the order of the characters in the alphabet.

English has the same problems: only one-case A-Z (or a-z) sorts correctly
automatically based on the code values. Sorting is inherently
language-specific: in German, A WITH DIAERESIS sorts within a,
whereas in Swedish it sorts after z.

> A solution to the problem is for an Esperanto word processor which needs
to
> sort words into alphabetical order and to have help files with
alphabetical
> lists of topics could conveniently use its own internal code using codes
> from the private use area.
>
> I am using for experimental purposes A is U+e001, B is U+e002, C is
U+e003,
> C circumflex is U+e004, D is U+e005 and so on to Z is U+e01c and also a is
> U+e021, b is U+e022, c is U+e023, c circumflex is U+e024, d is U+e025 and
so
> on to z is U+e03c.

Of course you are free to do this internally, as long as you translate
to and from Unicode at the boundaries of your application.

> I wonder if I may put forward a possible solution to the problem for
> discussion? Would it be desirable for the unicode standard to include a
> collection of code elements that are not part of the user space and that
> consists of the basic latin alphabet and all of the latin based accented
> characters placed in alphabetical order with each ordinary letter followed
> by all of the accented versions of that letter in the order that they
appear
> in the unicode standard.

This is in effect what Unicode Technical Report #10 is about:
it assigns a 32-bit number to each Unicode character which can be used
to sort it. (This is an oversimplification.)
The resulting tables need to be tailored for language-specific issue
like Swedish A WITH DIAERESIS or Traditional Spanish "ch" (sorts as a
single letter after "c" and before "d").

--
John Cowan					cowan@ccil.org
		e'osai ko sarji la lojban.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:43 EDT