Re: UTF-8 can be used for more than it is given credit

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Sun Jun 04 2006 - 04:18:00 CDT

Next message: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"

Previous message: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
In reply to: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Next in thread: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Reply: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Maybe reply: James Kass: "Re: UTF-8 can be used for more than it is given credit"
Maybe reply: Richard Wordingham: "Re: UTF-8 can be used for more than it is given credit"
Maybe reply: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Theodore H. Smith wrote on Saturday, June 03, 2006 at 12:56 PM
>>> And the other point is that a character (aka unicode glyph)

>> This is a misusage of the term "glyph" here, I believe.

> Really?

Yes. In Unicode terms, 'glyph' refers to the character shape, and
generally non-semantically significant glyph differences within a language
are not encoded for the sake of that language. Furthermore, as I understand
it, general stylistic features, which convey information above the level of
letters, such as italics, are also not encoded. One generally italicises
words, rather than letters. The overlap of symbols and letters complicates
matters.

There is a feeling that the Unicode character encoding standard is being
converted to a glyph encoding standard.

>> The semantics, which
>> you need to access tables for, inhere to the code points, so
>> you can't just treat a UTF-8 string as a bag o' bytes for
>> processing. <Counterargument snipped> (Except for trival operations
>> like string copying,
>> string length for buffer size, and so on.)
>
> But I already said I have Unicode correct upper casing and lowercasing
> code on UTF-8.

> What if I compile my source code and put it on my server host, to do
> uppercasing and lowercasing of UTF-8? And then post the address here. I'm
> no web monkey, more of a desktop developer, but I can probably handle an
> uppercase and lowercase button and a text field :)

Unnecessary. Just sketch the solutions.

> Would that prove to you that you can do uppercasing and lowercasing on
> UTF-8 without worrying about the codepoints?

Here's a test case -
U+1FA6 GREEK SMALL LETTER OMEGA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI

U+1FA6 decomposes to <U+03C9, U+0313, U+0342, U+0345> (combining classes 0,
230, 230 and 240 respectively).

How do you, Theodore Smith, go about converting <U+0369, U+0345, U+0313,
U+0342> to upper case (and not title case)?

The correct upper case form (see
http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt ) has three
canonically equivalent encodings:
<U+1F6E GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI, U+0399 GREEK
CAPITAL LETTER IOTA>
<U+1F68, U+0342, U+0399>
<U+03A9, U+0313, U+0342, U+0399>

Aside: What is the correct upper case form of <U+03B1, U+033D, U+0345> and
U+03B1, U+0345, U+033D>? Is it truly <U+0391, U+033D, U+0399>? I suspect
it depends on the semantics being applied to U+033D COMBINING X ABOVE.

Conversion to normal form D sounds rather brute force. By my calculation,
for Unicode 4.1 you have 55,903 pairs of characters to swap round, composed
from the 384 characters not of combining class 0.

Normal Form C is even worse for brute force. Just to compose U+1FB3 GREEK
SMALL LETTER ALPHA WITH YPOGEGRAMMENI you have to have 384-8 = 376 3-element
substitutions, such as <U+03B1, U+033D, U+0345> to <U+1FB3, U+033D>, 376 *
376 = 141,376 4-element substitutions,... (It has been suggested that it is
unreasonable to ask for sequences of more than 30 combining characters to be
processed properly.)

Richard.

Next message: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Previous message: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
In reply to: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit ( Re: UTF-7 - is it dead? )"
Next in thread: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Reply: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Maybe reply: James Kass: "Re: UTF-8 can be used for more than it is given credit"
Maybe reply: Richard Wordingham: "Re: UTF-8 can be used for more than it is given credit"
Maybe reply: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Jun 04 2006 - 04:29:43 CDT