Re: Sorting words in latin based languages

From: John Cowan (cowan@locke.ccil.org)
Date: Sat Jan 09 1999 - 14:34:53 EST


William Overington scripsit:

> When wishing to sort Esperanto words alphabetically by software a problem
> thus arises that does not occur with the sorting of English words where the
> numerical order of the numerical values of the code elements is the same
> order as the order of the characters in the alphabet.

English has the same problems: only one-case A-Z (or a-z) sorts correctly
automatically based on the code values. Sorting is inherently
language-specific: in German, A WITH DIAERESIS sorts within a,
whereas in Swedish it sorts after z.

> A solution to the problem is for an Esperanto word processor which needs to
> sort words into alphabetical order and to have help files with alphabetical
> lists of topics could conveniently use its own internal code using codes
> from the private use area.
>
> I am using for experimental purposes A is U+e001, B is U+e002, C is U+e003,
> C circumflex is U+e004, D is U+e005 and so on to Z is U+e01c and also a is
> U+e021, b is U+e022, c is U+e023, c circumflex is U+e024, d is U+e025 and so
> on to z is U+e03c.

Of course you are free to do this internally, as long as you translate
to and from Unicode at the boundaries of your application.

> I wonder if I may put forward a possible solution to the problem for
> discussion? Would it be desirable for the unicode standard to include a
> collection of code elements that are not part of the user space and that
> consists of the basic latin alphabet and all of the latin based accented
> characters placed in alphabetical order with each ordinary letter followed
> by all of the accented versions of that letter in the order that they appear
> in the unicode standard.

This is in effect what Unicode Technical Report #10 is about:
it assigns a 32-bit number to each Unicode character which can be used
to sort it. (This is an oversimplification.)
The resulting tables need to be tailored for language-specific issue
like Swedish A WITH DIAERESIS or Traditional Spanish "ch" (sorts as a
single letter after "c" and before "d").

-- 
John Cowan					cowan@ccil.org
		e'osai ko sarji la lojban.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:43 EDT