Re: Application that displays CJK text in Normalization Form D

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Mon Nov 15 2010 - 17:02:53 CST

  • Next message: Kent Karlsson: "Re: Application that displays CJK text in Normalization Form D"

    On 11/15/2010 2:24 PM, Kenneth Whistler wrote:
    >> FA47 is a "compatibility character", and would have a compatibility mapping.
    > Faulty syllogism.

    Formally correct answer but only because of something of a design flaw
    in Unicode. When the type of mapping was decided on, people didn't fully
    expect that NFC might become widely used/enforced, making these
    distinctions appear wherever text is normalized in a distributed
    architecture.
    > FA47 is a CJK Compatibility character, which means it was encoded
    > for compatibility purposes -- in this case to cover the round-trip
    > mapping needed for JIS X 0213.
    >
    > However, it has a *canonical* decomposition mapping to U+6F22.

    And that, of course, destroys the desired "round-trip" behavior if it is
    inadvertently applied while the data are encoded in Unicode. Hence the
    need to recreate a solution to the issue of variant forms with a
    different mechanism, the ideographic variation sequence (and
    corresponding database).

    > The behavior in BabelPad is correct: U+6F22 is the NFC form of U+FA47.
    >
    > Easily verified, for example, by checking the FA47 entry in
    > NormalizationTest.txt in the UCD.

    While correct, it's something that remains a bit of a gotcha. Especially
    now that Unicode has charts that go to great length showing the
    different glyphs for these characters, I would suggest adding a note to
    the charts that make clear that these distinctions are *removed* anytime
    the text is normalized, which, in a distributed architecture may happen
    anytime.

    A./
    > --Ken
    >
    >>> When I type ... (U+FA47) into BabelPad, highlight it, and then
    >>> click the button labeled "Normalize to NFC", the character
    >>> becomes ... (U+6F22). Does BabelPad not conform to the Unicode Standard
    >>> in this case? ...
    >
    >



    This archive was generated by hypermail 2.1.5 : Mon Nov 15 2010 - 17:05:20 CST