From: Doug Ewell (dewell@adelphia.net)
Date: Mon Aug 23 2004 - 13:51:32 CDT
Problem with accented charactersWilliam Tay wrote:
> Can anyone explain why an accented character is sometimes represented
> as a base character plus its accent? For example, the utf-8
> representation for é is 65 CC 81, which is the utf-8 representation
> for e and the accent, instead of C3 A9? I find that this is how MacOS
> X represents accented characters.
The two characters U+0065 and U+0301 (é) are canonically equivalent to
the single character U+00E9 (é). That is, the two-character combining
sequence is supposed to be considered equivalent to the single
precomposed character. Apparently MacOS X, or at least one application
running under it, does use the combining sequence.
> How can a C application that receives such utf-8 encoded characters
> handle them correctly? Appreciate your comments.
It must understand normalization. See TUS 4.0, section 5.6 for more
information.
-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/
This archive was generated by hypermail 2.1.5 : Mon Aug 23 2004 - 13:52:37 CDT