From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Fri Jun 09 2006 - 10:12:47 CDT
Theodore H. Smith wrote on Monday, June 05, 2006 at 5:43 PM
and I replied the same day, but the reply seems to have vanished, so I'm 
reposting.
> 3) Each unique glyph, has one and only sequence of codepoints in NFD. This 
> is a very good thing! Because it makes processing Unicode start  to 
> resemble sanity :) To reorder the combiners whose order doesn't  mater, we 
> just use their combining class number!
Not quite true, alas, but it's mostly true.  Most of the exceptions within a
script are where different characters have the same glyph, such as the
letter C and the Roman numeral for 100.  There are a few cases in Indic
scripts where normalisation stability prevents the solution of canonical
equivalence being applied, and there are some irremediable cases.
> I should have read the entire  SpecialCasing.txt file manually to see what 
> it says before hoping my  code will generate the right results from using 
> it :)
Have you read TUS discussion of casing?  It starts at Section 3.13.  It's a
bit uneven - the standard has clearly developed.
> I'll fix my code to handle that funny iota-subscript character,  probably 
> by using some kind of NFD code.
> Your uppercasing and underlining example makes me think. Is it true  that 
> this "combiner uppercasing to a non-combiner", character, the  iota 
> subscript, can cause many problems all over Unicode, by it's  very unusual 
> behaviour?
I'm not aware of any problems apart from casing.  However, I think you've
just spotted another casing problem with it!  See below.
> You mentioned that indic vowels will also  uppercase into non-combiners.
I don't think I did - Indic scripts don't have case.  The point with Indic
vowels is that some decompose into two combining class 0 components, so not
all decompositions are into a combining class 0 character followed by one or
zero non-zero combining class character.  There are also two Tibetan
combining class zero vowels that decompose into two non-zero combining class
characters.
I gove some examples of Greek text below, but be warned that they may not
render properly.  I've seen quite a variety of renderings as I've prepared
this posting.
> By the way, does:  Α̽Ι  (U+0391, U+033D, U+0399), lowercase to  α̽ι 
> (U+03B1, U+033D, U+03B9)? Or to ᾳ̽ (U+03B1, U+033D, U+0345)?
Casing operations are not reversible.  U+FB00 LATIN SMALL LIGATURE FF upper
cases to <U+0046, U+0046>, which lower cases to <U+0066, U+0066>.
By the rules, Α̽Ι lower cases to <U+03B1, U+033D, U+03B9>, which is not
unreasonable.  But your question raises a real issue.  Greek for Hades is
ᾍδης
<U+0391, U+0314, U+0301, U+0345, U+03B4, U+03B7, U+03C2> or ᾅδης <U+03B1,
U+0314, U+0301, U+0345, U+03B4, U+03B7, U+03C2>.  This uppercases to ἍΙΔΗΣ
<U+0391, U+0314, U+0301, U+0399, U+0394, U+0397, U+03A3>, which in turn
lower cases by the rules to ἅιδης <U+03B1, U+0314, U+0301, U+03B9, U+03B4,
U+03B7, U+03C2>.  Note the special rule to give the correct form of small
sigma!  However, the placement of the breathing and initial accent is
grammatically incorrect!  The only possible spellings with the accents
before the delta are ᾅδης and αἵδης <U+03B1, U+03B9, U+0314, U+0301,
U+03B4, U+03B7, U+03C2>.  They represent different pronunciations.  (There's
a third, attested possibility if you introduce a diaeresis.)  Note that
αἵδης would uppercase to ΑἽΔΗΣ <U+0391, U+0399, U+0314, U+0301, U+0394,
U+0397, U+03A3> - or at least, it does by Unicode rules.  I believe it also
does in Liddell and Scott, but when a capital vowel follows another vowel,
the accents appear to the latter's right in that dictionary.  (This
rendering behaviour is not mentioned in TUS Section 7.2.  It even happens
with a diaeresis, as in ἈΪ́Ω <U+0391, U+0313, U+0399, U+0308, U+0301,
U+03A9>, in which the diaeresis and acute appear between the iota and the
omega.)  Would any Grecians care to comment?
It looks as though the lowercasing rules ought to be changed!  However,
there are stability issues, so it may have to be restricted by locale, e.g.
limited to all known locales rather than being independent of locale.
Richard.
This archive was generated by hypermail 2.1.5 : Fri Jun 09 2006 - 10:15:07 CDT