From: Jim Allan (jallan@smrtytrek.com)
Date: Tue May 20 2003 - 01:42:26 EDT
Ken Whistler posted:
> The difficulty for _ae_, which many people who opine about this
> issue tend to overlook, is that the Unicode Standard also
> includes, from Nordic standards, a number of accented _ae_
> characters as precomposed characters. These make the table
> considerably more complicated if the default treatment for
> _ae_ is to weight it as an <a,e> sequence, since you then
> have to figure out what to do with the accented forms, for which
> you have just drained the base character weighting.
>
> In any case, inconsistent as it is for these two characters,
> the allkeys.txt table was constructed as it is for a reason,
> (or several reasons, actually),
> and I'm disinclined to suggest that its handling of _ae_
> and _oe_ should be restructured, since that ripples out to
> cause further destabilization of tailorings based on the
> current values in the table.
The webpage http://www.hum.ku.dk/ami/handbook/chapter3.html presents
examples of various Old Norse letters/ligatures, many with diacritics,
not yet implemented in Unicode, though a proposal is being drawn up (see
http://www.hit.uib.no/mufi).
Among these characters is the conjoined _oe_ character with both single
acute and with double acute.
There are also other two-character ligature-type combinations with
diacritics centered over the total combined character.
Though not counted as letters in any alphabet, except for the conjoined
_ae_ and conjoined _oe_, they carry diacritics in a manner which
indicates those who penned them considered them single characters.
The easiest answer might be to count all these characters as full
letters *of a kind*, a *kind* of letter not usually recognized as part
of an alphabet, and not decomposable, first because Unicode desires no
more decomposable characters, second because I don't think anyone wants
to have to add rules for determing whether a combining acute accent
follwing (for example) _aa_ ought to fall between the two letters or
over the last one depending on whether the two characters are ligated.
The characters, like the conjoined _oe_ in current Unicode, might still
be identified as ligatures in their official names.
As for the default sorting of these characters ...
It would seem that there are four different kinds of Latin base letters
of ligature form that can carry diacritics:
1. Base letters that almost never thought to be anything but a single
letter: _w_ and some linguistic character combinations. These should
sort according to some regular alphabetical sequencing (which for the
linguistic characters is a somewhat arbitrary but reasonable arrangement
first devised in Unicode itself).
2. Base letters like the conjoined _ae_, conjoined _oe_ and U+0223 (in
origin an _ou_ ligature) which in some environments are considered
simple ligatures rather than letters and in some are not, which are
somtimes counted as part of the alphabet by their users and sometimes
are not, and which should sort accordingly. This means whatever Unicode
decides for the default, tailoring will often be necessary for
individual use.
3. Base letters found in Old Norse which are never counted as part of an
alphabet and sort as though broken into their parts(?). I suppose some
such sort order as _aa_, _aá_, _áa_, conjoined _aa_, conjoined _a´a_
might make sense(?). I think ß might belong to this class.
Something will eventually have to be defined to take account of this. :-0
I hope there are not also cases of conjoined letters where a diacritic
may sometimes be applied to one of the parts of the conjoined letters,
and sometimes to the letter as a whole.
Currently conjoined _ae_ is assigned to class 1 and conjoined _oe_ is
assigned to class 3, allowing class 2 to be omitted.
Class 3 is the difficult one.
Jim Allan.
This archive was generated by hypermail 2.1.5 : Tue May 20 2003 - 02:38:03 EDT