Re: UTF-8 can be used for more than it is given credit

From: Theodore H. Smith (delete@elfdata.com)
Date: Sun Jun 04 2006 - 06:38:04 CDT

Next message: Adam Twardoch: "Re: Glyphs for German quotation marks"

Previous message: Richard Wordingham: "Re: UTF-8 can be used for more than it is given credit"
In reply to: Richard Wordingham: "Re: UTF-8 can be used for more than it is given credit"
Next in thread: Richard Wordingham: "Re: UTF-8 can be used for more than it is given credit"
Reply: Richard Wordingham: "Re: UTF-8 can be used for more than it is given credit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> Unnecessary. Just sketch the solutions.
>
>> Would that prove to you that you can do uppercasing and
>> lowercasing on UTF-8 without worrying about the codepoints?
>
> Here's a test case -
> U+1FA6 GREEK SMALL LETTER OMEGA WITH PSILI AND PERISPOMENI AND
> YPOGEGRAMMENI
>
> U+1FA6 decomposes to <U+03C9, U+0313, U+0342, U+0345> (combining
> classes 0, 230, 230 and 240 respectively).

My UTF-8 decomposer gives that result :)

Although it expressed the decomp like this: ω ̓
͂ ͅ

It uppercased the UTF-8 form to this a UTF-8 sequence which was
equivalent to this: Ω ̓ ͂ Ι

> How do you, Theodore Smith, go about converting <U+0369, U+0345, U
> +0313, U+0342> to upper case (and not title case)?
>
> The correct upper case form (see http://www.unicode.org/Public/
> UNIDATA/SpecialCasing.txt ) has three canonically equivalent
> encodings:
> <U+1F6E GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI, U
> +0399 GREEK CAPITAL LETTER IOTA>
> <U+1F68, U+0342, U+0399>
> <U+03A9, U+0313, U+0342, U+0399>
>
> Aside: What is the correct upper case form of <U+03B1, U+033D, U
> +0345>

Mine gives: Α ̽ Ι

> and U+03B1, U+0345, U+033D>?

Mine gives this: Α Ι ̽

> Is it truly <U+0391, U+033D, U+0399>? I suspect it depends on the
> semantics being applied to U+033D COMBINING X ABOVE.
>
> Conversion to normal form D sounds rather brute force. By my
> calculation, for Unicode 4.1 you have 55,903 pairs of characters to
> swap round, composed from the 384 characters not of combining class 0.

Yes... I don't do Normalisation yet on UTF-8, because I still don't
udnerstand Normalisation properly :)

> Normal Form C is even worse for brute force. Just to compose U
> +1FB3 GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI you have to have
> 384-8 = 376 3-element substitutions, such as <U+03B1, U+033D, U
> +0345> to <U+1FB3, U+033D>, 376 * 376 = 141,376 4-element
> substitutions,... (It has been suggested that it is unreasonable
> to ask for sequences of more than 30 combining characters to be
> processed properly.)

If you could explain Normalisation to me in a 2 paragraphs, maybe
I'll understand you better :)

So far my UTF-8 uppercaser/lowercaser is doing quite well eh? And the
best thing is, it's Unicode blind. It's only byte aware.

I really should put this into a web available form because that
statement seems to put people's minds into a loop.

As for "Just sketch the solutions"... I did that already, in previous
emails. It requires a string based dictionary to do at all. Something
not too hard, as even stl's hash_map can do this on a char*.

And to do efficiently it requires a trie based string dictionary
which is capable of detecting the longest key at a position within
the string.

Do you have a trie based string dictionary that works on unsigned
chars? Do you have one which has a complete and powerful API for
processing strings? If not, I can imagine why you haven't thought it
was possible yet. Such tools aren't common.

The toolkit I'm using, I wrote myself, and I've not seen anything
like it yet.

Next message: Adam Twardoch: "Re: Glyphs for German quotation marks"
Previous message: Richard Wordingham: "Re: UTF-8 can be used for more than it is given credit"
In reply to: Richard Wordingham: "Re: UTF-8 can be used for more than it is given credit"
Next in thread: Richard Wordingham: "Re: UTF-8 can be used for more than it is given credit"
Reply: Richard Wordingham: "Re: UTF-8 can be used for more than it is given credit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Jun 04 2006 - 06:55:00 CDT