Re: UTF-8 can be used for more than it is given credit

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Sun Jun 04 2006 - 16:19:30 CDT

  • Next message: Theodore H. Smith: "Re: UTF-8 can be used for more than it is given credit"

    Theodore H. Smith wrote on Sunday, June 04, 2006 at 7:05 PM
    > On 4 Jun 2006, at 16:59, Richard Wordingham wrote:
    >> Theodore H. Smith wrote on Sunday, June 04, 2006 at 12:38 PM

    > I don't understand NFD yet either :) All I know is that combining
    > characters may cause problems for search functions, and that some
    > re-ordering is necessary to fix this. But what kind of re- ordering
    > exactly I do not know.

    There are two reasons the need for something like NFD arises.

    The idea is that Unicode not specify the order of diacritic marks. What is
    felt to be the natural order in one language may not be the same in another
    language. Moreover, there is no guarantee that everyone will type multiple
    marks in the same order when they may be typed separately. Thus an 'a' with
    a circumflex above and a dot below may be entered in the order <U+0061,
    U+0302, U+0323> in Vietnamese, where the dot below is a tonemark. However,
    there is an interpretation of the ISO:11940:1998 transliteration scheme for
    Thai where the dot below would indicate that the vowel is implicit and the
    circumflex indicates the tone mark. Especially for someone used to the Thai
    typing rule of adding marks to a consonant from bottom to top, this would
    naturally be written as <U+0061, U+0323, U+0302>.

    The second reason is that, apparently contrary to the original conception of
    Unicode, combinations of base letter and diacritic may be encoded as single
    characters. (This practice of adding such codepoints to the standard has
    now been largely discontinued.) Thus 'a' circumflex may encoded as either
    <U+0061, U+0302> or <U+00E2>.

    There are in fact six different ways of encoding small 'a' with a circumflex
    above and a dot below:

    <U+1EAD LATIN SMALL LETTER A WITH CIRCUMFLEX AND DOT BELOW>
    <U+00E2, U+0323>
    <U+0061, U+0302, U+0323>
    <U+1EA1 LATIN SMALL LETTER A WITH DOT BELOW, U+0302>
    <U+0061, U+0323, U+0302>

    The process of determining whether these are the same is:

    1) Decompose, so we can see what we get.
    2) Declare that the order of diacritics only matters if they compete for
    position on paper, e.g. when diacritics stack vertically or are arranged in
    order from left to right. This test is difficult to computerise, so it is
    expressed as follows:
    (a) Assign diacritics to combining classes, with the same number if they
    compete for position, and different numbers if they do not. This is called
    combining class.
    (b) Identify non-diacritics by assigning them combining class zero. One
    complication is that decomposition is not always into diacritic and
    non-diacritic - two part Indic vowels usually also have decompositions,
    generally with both being of combining class zero.

    Sequences which then have no differences that matter are 'canonically
    equivalent'.

    Finally, to reduce comparison of strings to comparison of codepoint
    sequences, we define a normalised form, i.e. select a representative element
    from each class of equivalent sequences. How do we do this? We define a
    sequence to be in Normal Form D (D for 'decomposed') (NFD) if:
    (i) No character in it has a canonical decomposition;
    (ii) Every sequence of characters of non-zero combining class is in order of
    combining class.
    Every canonical equivalence class of codepoint sequences has exactly one
    member which is in NFD.

    For the example above, it is <U+0061, U+0323, U+0302> that is NFD.

    The objection to NFD is that it is the form which uses the most codepoints.
    Converting to NFC is the deterministic selection of a compact member of the
    equivalence class to act as its representative. (There are probably
    examples where it is not the most compact member, even without considering
    'composition exclusions'.) I don't think I have anything useful to add to
    the account of NFC I gave earlier.

    > That is to say that for f(x)=y, you can get different values of y for the
    > same value of x?

    Writing ~ for canonical equivalence, and = for identity of codepoint
    sequence, then for a Unicode-compliant transformation f, if a~b, then we
    require f(a)~f(b). Uppercasing and lowercasing are required (defined?) to
    be Unicode-compliant transformations. Inconveniently, even ignoring the
    issues of locales, simply applying the casing data in UnicodeData.txt does
    not result in a Unicode-compliant transformation. The problem is the
    behaviour of subscript iota. When text is converted to uppercase, the
    subscript iota becomes a full capital iota. Subscript iota has positve
    combining class; capital iota has combining class 0.

    > I do indeed get &#x03A9; &#x0399; &#x0313; &#x0342; when trying to
    > uppercase <U+03C9, U+0345, U+0313, U+0342>. Why? What's wrong with the
    > result?

    U+03A9 has combining class zero.
    U+03C9 has combining class zero.
    U+0399 has combining class zero.
    U+0345 has combing class 240.
    U+0313 has combining class 230.
    U+0343 has combining class 230.

    <U+03C9, U+0345, U+0313, U+0342> is therefore canonically equivalent (~) to
    <U+03C9, U+0313, U+0342, U+0345>,
    ~ <U+ 1F60 GREEK SMALL LETTER OMEGA WITH PSILI, U+0342, U+0345>
    ~ <1F66 GREEK SMALL LETTER OMEGA WITH PSILI AND PERISPOMENI, U+0345>
    ~ <1FA6 GREEK SMALL LETTER OMEGA WITH PSILI AND PERISPOMENI AND
    YPOGEGRAMMENI>

    But http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt states that the
    upper case form of U+1FA6 is <U+1F6E, U+0399>. But
    <U+1F6E, U+0399> ~ <U+03A9, U+0313, U+0342, U+0399>, which is not
    canonically equivalent to <U+03A9, U+0399, U+0313, U+0342>. That is what is
    wrong.

    > If there is something wrong with the result, it could be that perhaps
    > with smarter input UTF-8 conversion data tables, I can get correct
    > result.

    http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt does tell you what
    you need to do - 'IMPORTANT-when capitalizing iota-subscript (0345), it MUST
    be in normalized form--moved to the end of any sequence of combining marks.'

    Essentially, before you apply simple capitalisation, you must ensure that
    any character containing U+0345 in its decomposition is final or is followed
    by a character of combining class zero. It is for this reason that U+0345
    has the highest canonical combining class (240), and is the only character
    in that class. One way of doing this - the one involving the least
    programming effort - is to convert to NFD and then capitalise.

    >> So your process is not Unicode-compliant, for, to use the standard
    >> citation form for Unicode codepoints, <U+0391, U+033D, U+0399> and
    >> <U+0391, U+0399, U+033D> are not canonically equivalent, whereas the
    >> inputs, <U+03B1, U+033D, U+0345> and <U+03B1, U+0345, U+033D>, are.

    This of course is another example of exactly the same problem.

    This contrived example demonstrates that NFC only works for normal Greek.
    The NFC of <U+03B1, U+033D, U+0345> is <U+1FB3 GREEK SMALL LETTER ALPHA WITH
    YPOGEGRAMMENI, U+033D>, and it would naively uppercase to <U+0391 GREEK
    CAPITAL LETTER ALPHA, U+0399, U+033D>, which is not equivalent to the naive
    upper case of the NFD form, <U+0391, U+033D, U+0399>. I raised this
    combination as an aside because it did not seem semantically correct. An
    even better example of the same thing is
    ᾔ̲δ̲η̲ (with combining underline under all letters). In NFC it is

    <U+1F94 GREEK SMALL LETTER ETA WITH PSILI AND OXIA AND YPOGEGRAMMENI,
    U+0332 COMBINING LOW LINE, U+03B4 GREEK SMALL LETTER DELTA, U+0332,
    U+03B7 GREEK SMALL LETTER ETA, U+0332>. That capitalises by the rules (or
    at least, if you first convert to NFD) to
    Ἤ̲ΙΔ̲Η̲ (with just three of the four letters underlined!)

    <U+1F2C GREEK CAPITAL LETTER ETA WITH PSILI AND OXIA,
    U+0332, U+0399 GREEK CAPITAL LETTER IOTA, U+0394 GREEK CAPITAL LETTER DELTA,
    U+0332,
    U+0397 GREEK CAPITAL LETTER ETA, U+0332>. Clearly underlining and
    uppercasing do not commute!

    Richard.



    This archive was generated by hypermail 2.1.5 : Sun Jun 04 2006 - 16:32:40 CDT