From: Mike Ayers (mayers@celequest.com)
Date: Thu Apr 06 2006 - 15:43:20 CST
Tay, William wrote:
> Hi,
>
> I have a C/C++ UNIX application that uses standard UTF-8 as the internal
> text encoding. If it receives a UTF-8 encoded decomposed accented
> character, i.e. base character + accent, from a MacOS X application, it
> would need to be able to detect that the character was decomposed, and
> then compose it prior to further processing. Is there any Solaris/UNIX
> utility or functions that can help my application do the detection and
> character composition?
You should take a look at ICU at http://icu.sourceforge.net. It does
what you need and a lot of things you may not have thought of yet.
> Now, the application from which the decomposed accented character
> originated may query my application so that the character is returned to
> it. If my application has already composed the character, won't it be a
> problem for the querying application, since it expects to receive the
> character in its decomposed format?
If the applications are treating Unicode strings as binary data, as
some applications, most notably many file systems, do, then you may want
to preserve the original value alongside the normalized form. This
approximately doubles storage requirements. You could, instead,
normalize as needed if it is computationally affordable.
> Can accented characters be decomposed in other encodings, e.g. ISO
> 8859-1, as well?
>
> Btw, what common applications/operating systems generate decomposed
> accented characters?
I don't know, but I the preserve+normalize strategy should eliminate
these concerns.
HTH,
/|/|ike
This archive was generated by hypermail 2.1.5 : Thu Apr 06 2006 - 15:46:38 CST