Re: UTF-8 can be used for more than it is given credit

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Sun Jun 04 2006 - 04:18:00 CDT

  • Next message: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"

    Theodore H. Smith wrote on Saturday, June 03, 2006 at 12:56 PM
    >>> And the other point is that a character (aka unicode glyph)

    >> This is a misusage of the term "glyph" here, I believe.

    > Really?

    Yes. In Unicode terms, 'glyph' refers to the character shape, and
    generally non-semantically significant glyph differences within a language
    are not encoded for the sake of that language. Furthermore, as I understand
    it, general stylistic features, which convey information above the level of
    letters, such as italics, are also not encoded. One generally italicises
    words, rather than letters. The overlap of symbols and letters complicates
    matters.

    There is a feeling that the Unicode character encoding standard is being
    converted to a glyph encoding standard.

    >> The semantics, which
    >> you need to access tables for, inhere to the code points, so
    >> you can't just treat a UTF-8 string as a bag o' bytes for
    >> processing. <Counterargument snipped> (Except for trival operations
    >> like string copying,
    >> string length for buffer size, and so on.)
    >
    > But I already said I have Unicode correct upper casing and lowercasing
    > code on UTF-8.

    > What if I compile my source code and put it on my server host, to do
    > uppercasing and lowercasing of UTF-8? And then post the address here. I'm
    > no web monkey, more of a desktop developer, but I can probably handle an
    > uppercase and lowercase button and a text field :)

    Unnecessary. Just sketch the solutions.

    > Would that prove to you that you can do uppercasing and lowercasing on
    > UTF-8 without worrying about the codepoints?

    Here's a test case -
    U+1FA6 GREEK SMALL LETTER OMEGA WITH PSILI AND PERISPOMENI AND YPOGEGRAMMENI

    U+1FA6 decomposes to <U+03C9, U+0313, U+0342, U+0345> (combining classes 0,
    230, 230 and 240 respectively).

    How do you, Theodore Smith, go about converting <U+0369, U+0345, U+0313,
    U+0342> to upper case (and not title case)?

    The correct upper case form (see
    http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt ) has three
    canonically equivalent encodings:
    <U+1F6E GREEK CAPITAL LETTER OMEGA WITH PSILI AND PERISPOMENI, U+0399 GREEK
    CAPITAL LETTER IOTA>
    <U+1F68, U+0342, U+0399>
    <U+03A9, U+0313, U+0342, U+0399>

    Aside: What is the correct upper case form of <U+03B1, U+033D, U+0345> and
    U+03B1, U+0345, U+033D>? Is it truly <U+0391, U+033D, U+0399>? I suspect
    it depends on the semantics being applied to U+033D COMBINING X ABOVE.

    Conversion to normal form D sounds rather brute force. By my calculation,
    for Unicode 4.1 you have 55,903 pairs of characters to swap round, composed
    from the 384 characters not of combining class 0.

    Normal Form C is even worse for brute force. Just to compose U+1FB3 GREEK
    SMALL LETTER ALPHA WITH YPOGEGRAMMENI you have to have 384-8 = 376 3-element
    substitutions, such as <U+03B1, U+033D, U+0345> to <U+1FB3, U+033D>, 376 *
    376 = 141,376 4-element substitutions,... (It has been suggested that it is
    unreasonable to ask for sequences of more than 30 combining characters to be
    processed properly.)

    Richard.



    This archive was generated by hypermail 2.1.5 : Sun Jun 04 2006 - 04:29:43 CDT