From: Doug Ewell (dewell@adelphia.net)
Date: Sun Apr 27 2003 - 14:16:58 EDT
Jane Liu <xjliu_ca at yahoo dot com> wrote:
> I am using IBM ICU V1.8 for some testing on Windows 2000 and XP, I
> found when I process some CJK characters, ICU by default will
> normalize it. For example, U+FA19(神) will be replaced by U+795E
> (神). However, if I save that two characters into a file on Windows
> 2000 and XP by using Notepad and select "Unicode" as the encoding, I
> don't see Notepad would do such normalization/replacement. Also, on
> Windows file system, I can also use that two characters in the
> file/folder name, and no normalization seems to be done by the OS
> either ...
At first I was going to reply, somewhat smugly, that U+FA19 was in the
CJK *Compatibility* Ideographs block, and of course operating systems
and other processes are not required (or even encouraged) to substitute
compatibility-equivalent characters automatically.
But upon checking the UCD, I found that U+FA19 and U+795E are in fact
*canonical* equivalents, not compatibility equivalents, despite the name
of the block.
U+FA19 falls into the category of "ideographs from various regional and
industry standards [that] were encoded in this block, primarily to
achieve round-trip conversion compatibility" (TUS 3.0, p. 267). In the
code charts, U+FA19 is listed as one of "The IBM 32 compatibility
additions" (p. 803). So the intent in encoding these characters seems
to have been to support round-trip conversion with an existing standard,
and it occurs to me that for that reason, an operating system might need
to maintain the distinction between the two.
Conformance requirement C9 says, "A process shall not assume that the
interpretations of two canonical-equivalent character sequences are
distinct" (p. 38). To me, the word "sequences" is a hint that the UTC
may have been thinking more of combining sequences (like "a" plus
diaeresis) than Han equivalents. The text immediately following C9
says, "There are practical circumstances under which implementations may
reasonably distinguish them." One could easily conclude that an
operating system's need to preserve round-trip capability is one of
these circumstances.
So:
> Can anyone please shed some lights on:
>
> 1. Why Windows doesn't do normalization,
Because it isn't required to, and there may be a compelling reason in
this case not to.
> and is there any ways to ask Windows to do it?
No.
> 2. If Windows never do normalization, how should I balance this in my
> Windows based application since I am using the ICU. I don't think
> simply turn off the normalization process in the ICU would be a good
> idea though, however, if I keep to use ICU to normalize everything in
> my application, then I will possible run into some troubles when
> dealing with the Windows system ...
If you are dealing with one system (ICU) that does or does not perform
this normalization, depending on user preference, and another system
(Windows) that does not, and you need to have the results from the two
systems match, then it seems logical to turn off normalization in ICU.
-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/
This archive was generated by hypermail 2.1.5 : Sun Apr 27 2003 - 14:56:48 EDT