Re: Application that displays CJK text in Normalization Form D

From: Asmus Freytag ([email protected])
Date: Mon Nov 15 2010 - 17:02:53 CST

Next message: Kent Karlsson: "Re: Application that displays CJK text in Normalization Form D"

Previous message: Doug Ewell: "RE: Application that displays CJK text in Normalization Form D"
In reply to: Kenneth Whistler: "RE: Application that displays CJK text in Normalization Form D"
Next in thread: Doug Ewell: "RE: Application that displays CJK text in Normalization Form D"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 11/15/2010 2:24 PM, Kenneth Whistler wrote:
>> FA47 is a "compatibility character", and would have a compatibility mapping.
> Faulty syllogism.

Formally correct answer but only because of something of a design flaw
in Unicode. When the type of mapping was decided on, people didn't fully
expect that NFC might become widely used/enforced, making these
distinctions appear wherever text is normalized in a distributed
architecture.
> FA47 is a CJK Compatibility character, which means it was encoded
> for compatibility purposes -- in this case to cover the round-trip
> mapping needed for JIS X 0213.
>
> However, it has a *canonical* decomposition mapping to U+6F22.

And that, of course, destroys the desired "round-trip" behavior if it is
inadvertently applied while the data are encoded in Unicode. Hence the
need to recreate a solution to the issue of variant forms with a
different mechanism, the ideographic variation sequence (and
corresponding database).

> The behavior in BabelPad is correct: U+6F22 is the NFC form of U+FA47.
>
> Easily verified, for example, by checking the FA47 entry in
> NormalizationTest.txt in the UCD.

While correct, it's something that remains a bit of a gotcha. Especially
now that Unicode has charts that go to great length showing the
different glyphs for these characters, I would suggest adding a note to
the charts that make clear that these distinctions are *removed* anytime
the text is normalized, which, in a distributed architecture may happen
anytime.

A./
> --Ken
>
>>> When I type ... (U+FA47) into BabelPad, highlight it, and then
>>> click the button labeled "Normalize to NFC", the character
>>> becomes ... (U+6F22). Does BabelPad not conform to the Unicode Standard
>>> in this case? ...
>
>

Next message: Kent Karlsson: "Re: Application that displays CJK text in Normalization Form D"
Previous message: Doug Ewell: "RE: Application that displays CJK text in Normalization Form D"
In reply to: Kenneth Whistler: "RE: Application that displays CJK text in Normalization Form D"
Next in thread: Doug Ewell: "RE: Application that displays CJK text in Normalization Form D"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Nov 15 2010 - 17:05:20 CST