From: Markus Scherer (
Date: Fri Mar 14 2003 - 12:29:31 EST
Let's try this:
ICU has C header files with macros for code point handling in UTF-8/16 strings. See the utf8.h and
utf16.h headers (together with utf.h) in ICU's source tree at source/common/unicode/.
There is also a utf32.h header, but that is empty now. I redesigned the set of macros last year to
simplify and improve them a bit.
Specifically, see below.
(Note that the UTF-8 macros [except for the "unsafe" ones] handle the complicated cases in functions
that are called from inside the macros. See source/common/utf_impl.c . Safe UTF-8 handling requires
a lot of error checks.)
askq1 askq1 wrote:
> I want c/c++ code that will give me UTF8 byte sequence representing a
> given code-point, UTF16 16 bits sequence reppresenting a given
> code-point, UTF32 32 bits sequence representing a given code-point.
> e.g.
> UTF8_Sequence CodePointToUTF8(Unichar codePoint)
Use U8_APPEND().
To read a code point from UTF-8, use U8_NEXT()
or U8_GET() etc.
> UTF16_Sequence CodePointToUTF16(Unichar codePoint)
To read a code point from UTF-8, use U16_NEXT()
or U16_GET() etc.
> UCS2_Sequence CodePointToUCS2(Unichar codePoint)
For UCS-2, the best strategy (in my opinion) is to treat it exactly the same as UTF-16. Most people
mean UTF-16 when they talk about UCS-2 or generally about "16-bit Unicode".
If you do want to distinguish them anyway, then this is trivial:
if(0<=codePoint<=0xffff) {
cast codePoint to 16-bit type and emit;
} else {
Similarly, UTF-32 is trivial as well - it just stores each code point value in a 32-bit integer
unit. Unicode code points are values 0..0x10ffff.
See also
I hope this helps - best regards,
-- Opinions expressed here may not reflect my company's positions unless otherwise noted.
This archive was generated by hypermail 2.1.5 : Fri Mar 14 2003 - 13:06:21 EST