Re: UTF-8 can be used for more than it is given credit

From: Richard Wordingham ([email protected])
Date: Tue Jun 13 2006 - 17:18:59 CDT

Next message: Richard Wordingham: "Re: triple diacritic (sch with ligature tie in a German dialect writing document)"

Previous message: Cristian Secară: "Re: Yahoo groups support for Unicode"
Maybe in reply to: Richard Wordingham: "Re: UTF-8 can be used for more than it is given credit"
Next in thread: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Theodore H. Smith wrote on Tuesday, June 13, 2006 at 2:39 PM

> On 4 Jun 2006, at 22:19, Richard Wordingham wrote:

>> But http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt states that
>> the upper case form of U+1FA6 is <U+1F6E, U+0399>. But
>> <U+1F6E, U+0399> ~ <U+03A9, U+0313, U+0342, U+0399>, which is not
>> canonically equivalent to <U+03A9, U+0399, U+0313, U+0342>. That
>> is what is wrong.

> For what it's worth, even my "NFD that fails NormalizationTests.txt" code
> (that I wrote over the weekend), now can handle this :)

> <U+03C9, U+0345, U+0313, U+0342> (ᾦ) now will uppercase to <U
+03A9, U+0313, U+0342, U+0399> ( ὮΙ ), using my new UTF-8
uppercaser :)

> Actually, now that I understand a little more of what's going on, I
can see that you did throw me a bit of a screw-ball here ;)

What do you get for <U+03C9, U+0345, U+0301, U+0302, U+0307, U+0308>?

And for the googly - <U+03C9, U+0345, U+0301, U+0302, U+0307, U+0308,
U+0F73>?

> You were entirely correct my code did not uppercase properly unless
it could handle denormalised characters, due to funny characters
which change from combiners to non-combiners during uppercasing.

> My code basically works like this:
<Snip>
> 2) Unicode-blind stage, this does the uppercasing/lowercasing/NFD stuff.
> It's all byte-aware! Well, more specifically, it is "variable length
> string unit aware". But the "string units" are composed of bytes, not
> shorts or longs.

Is this single pass, or multi-pass? I think it has to be multi-pass. And,
to transform to NFD, it needs, for Unicode 4.1.0, 55,903 codepoint swaps to
be stored in the data table.

> Does this prove that you can correctly process UTF-8 natively, on a
per-character basis, without intermediate conversion to codepoints or
UTF-32?

The YPOGEGRAMMENI issue was not as bad as I first thought. And I owe you an
apology, for it appears that your implementation actually was correct!
Sorry. What you have now is merely linguistically better, rather than more
correct. :-(

I never thought it couldn't be done. However, I believe you are having to
resort to multiple passes because you don't store canonical combining class.
(Obviously, you could store that using a UTF-8 based trie. My code, written
for understanding rather than speed, effectively uses a trie with letters
from different alphabets - a 17 character alphabet (i.e. plane), a 512
character alphabet (half-block within plane), and a 128 character alphabet
(byte within the block).

Richard.

Next message: Richard Wordingham: "Re: triple diacritic (sch with ligature tie in a German dialect writing document)"
Previous message: Cristian Secară: "Re: Yahoo groups support for Unicode"
Maybe in reply to: Richard Wordingham: "Re: UTF-8 can be used for more than it is given credit"
Next in thread: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Jun 13 2006 - 18:29:32 CDT