Sorting words in latin based languages

From: William Overington (inventor@ngo.globalnet.co.uk)
Date: Fri Jan 08 1999 - 06:55:02 EST


The alphabet used by the Esperanto language has 28 letters, namely, in
order, a, b, c, c circumflex, d, e, f, g, g circumflex, h, h circumflex, i,
j, j circumflex, k, l, m, n, o, p, r, s, s circumflex, t, u, u breve, v, z.

All of these characters can be encoded in unicode so as to produce
displayable text.

However, the numerical order of the numerical values of the code elements is
not the same order as the order of the characters in the alphabet.

When wishing to sort Esperanto words alphabetically by software a problem
thus arises that does not occur with the sorting of English words where the
numerical order of the numerical values of the code elements is the same
order as the order of the characters in the alphabet.

A solution to the problem is for an Esperanto word processor which needs to
sort words into alphabetical order and to have help files with alphabetical
lists of topics could conveniently use its own internal code using codes
from the private use area.

I am using for experimental purposes A is U+e001, B is U+e002, C is U+e003,
C circumflex is U+e004, D is U+e005 and so on to Z is U+e01c and also a is
U+e021, b is U+e022, c is U+e023, c circumflex is U+e024, d is U+e025 and so
on to z is U+e03c.

It occurs to me that a standardized coding for Esperanto, be it this one or
some other, would be a very useful facility for software that involves
Esperanto. The document "The Unicode Standard A Technical Introduction"
includes the following paragraph.

Quote

A range of code values are reserved as user space. These code values have no
universal meaning, and may be used for characters specific to a program or
by a group of users for their own purposes. For example, a group of
choreographers may design a set of characters for dance notation and encode
the characters using code values in user space. A set of page-layout
programs may use the same code values as control codes to position text on
the page. The main point of user space is that the Unicode Standard assigns
no meaning to these code values, and reserves them as user space, promising
never to assign them meaning in the future.

End of quote.

What is the policy if the group of choreographers wish to publish their set
of characters so that it becomes a standard, either a formal standard or an
as a fact standard?

What is the policy if a group of people interested in Esperanto wish to
publish their set of characters so that it becomes a standard, either a
formal standard or an as a fact standard?

The two cases are not congruently identical because the Esperanto language
can already be coded using unicode characters.

It may also be the case that alphabetic sorting of words would be a useful
facility in each of the many latin based languages which use one or more
accented characters. This could then lead to a large number of private code
tables, one for each language, each code table widely used as a sort of
standard for word processing in that language.

I wonder if I may put forward a possible solution to the problem for
discussion? Would it be desirable for the unicode standard to include a
collection of code elements that are not part of the user space and that
consists of the basic latin alphabet and all of the latin based accented
characters placed in alphabetical order with each ordinary letter followed
by all of the accented versions of that letter in the order that they appear
in the unicode standard. Authors of software using latin based characters
could then use that encoding within the software if they wished. As to
whether files should be saved to disc using the internal code or whether
text should be converted back to the present codings of the characters would
be a matter for discussion. Naturally a decision would need to be made as
to where to put codes for numerals, punctuation and brackets. They could
perhaps be placed in the same relative positions to uppercase and lowercase
and to each other as they occupy in presently existing unicode so that
sorting of lists involving numerals and punctuation would produce the same
result with English regardless of whether the sorting was done with the
basic ascii codes or with such a comprehensive table of code elements.

William Overington
8 January 1999



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:43 EDT