From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Tue Jun 13 2006 - 17:18:59 CDT
Theodore H. Smith wrote on Tuesday, June 13, 2006 at 2:39 PM
> On 4 Jun 2006, at 22:19, Richard Wordingham wrote:
>> But http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt states that
>> the upper case form of U+1FA6 is <U+1F6E, U+0399>. But
>> <U+1F6E, U+0399> ~ <U+03A9, U+0313, U+0342, U+0399>, which is not
>> canonically equivalent to <U+03A9, U+0399, U+0313, U+0342>. That
>> is what is wrong.
> For what it's worth, even my "NFD that fails NormalizationTests.txt" code
> (that I wrote over the weekend), now can handle this :)
> <U+03C9, U+0345, U+0313, U+0342> (ᾦ) now will uppercase to <U
+03A9, U+0313, U+0342, U+0399> ( ὮΙ ), using my new UTF-8
uppercaser :)
> Actually, now that I understand a little more of what's going on, I
can see that you did throw me a bit of a screw-ball here ;)
What do you get for <U+03C9, U+0345, U+0301, U+0302, U+0307, U+0308>?
And for the googly - <U+03C9, U+0345, U+0301, U+0302, U+0307, U+0308,
U+0F73>?
> You were entirely correct my code did not uppercase properly unless
it could handle denormalised characters, due to funny characters
which change from combiners to non-combiners during uppercasing.
> My code basically works like this:
<Snip>
> 2) Unicode-blind stage, this does the uppercasing/lowercasing/NFD stuff.
> It's all byte-aware! Well, more specifically, it is "variable length
> string unit aware". But the "string units" are composed of bytes, not
> shorts or longs.
Is this single pass, or multi-pass? I think it has to be multi-pass. And,
to transform to NFD, it needs, for Unicode 4.1.0, 55,903 codepoint swaps to
be stored in the data table.
> Does this prove that you can correctly process UTF-8 natively, on a
per-character basis, without intermediate conversion to codepoints or
UTF-32?
The YPOGEGRAMMENI issue was not as bad as I first thought. And I owe you an
apology, for it appears that your implementation actually was correct!
Sorry. What you have now is merely linguistically better, rather than more
correct. :-(
I never thought it couldn't be done. However, I believe you are having to
resort to multiple passes because you don't store canonical combining class.
(Obviously, you could store that using a UTF-8 based trie. My code, written
for understanding rather than speed, effectively uses a trie with letters
from different alphabets - a 17 character alphabet (i.e. plane), a 512
character alphabet (half-block within plane), and a 128 character alphabet
(byte within the block).
Richard.
This archive was generated by hypermail 2.1.5 : Tue Jun 13 2006 - 18:29:32 CDT