Re: Fw: Unicode & space in programming & l10n

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Sep 22 2006 - 17:09:37 CDT

Next message: Steve Summit: "Re: Fw: Unicode & space in programming & l10n"

Previous message: Addison Phillips: "Re: Problem with SSI and BOM"
Maybe in reply to: Philippe Verdy: "Fw: Unicode & space in programming & l10n"
Next in thread: Steve Summit: "Re: Fw: Unicode & space in programming & l10n"
Reply: Steve Summit: "Re: Fw: Unicode & space in programming & l10n"
Reply: Richard Wordingham: "Re: Fw: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> Not quite. Unsigned int is only guaranteed a range of 0 to 0xffff and
> therefore it can't normalise the string <U+FAD5> - the normalised form is
> <U+25249> in all four normalisations.

It *can*, if you abstract your type definitions correctly.

> Of course, unsigned int is good
> enough to hold UTF-16 code *units*, which might just be what Mike meant.
> (I.e., the type supports UTF-16, but not UTF-32.)

It is perfectly fine for UTF-32, if you do this correctly. For
example:

typedef unsigned short UShort16;
typedef unsigned int UInt32;

typedef UShort16 utf16char;
typedef UInt32 utf32char;

Put that stuff in a fundamental header file, and use "utf32char"
everywhere you mean a UTF-32 code unit and "utf16char" everywhere
you mean a UTF-16 code unit, instead of "unsigned int" anywhere
in the code.

At that point, you can safely port your entire code to *any*
platform, with at most one compiler-specific #ifdef in your
fundamental header file.

> Of course, you may be able to create Unicode string constants - it all
> depends what data structure is used. FFFF-terminated arrays would work,
> e.g.
>
> static const unsigned int[] remark = {
> LATIN_L, LATIN_o, LATIN_o, LATIN_k, EXCLAMATION_MARK, 0xffff}

For C/C++ programmers, it is, of course, much easier to go with
NULL-terminated arrays, as then all your 16-bit and 32-bit string
processing can be cloned almost exactly on your 8-bit string
processing routine logic.

Using a non-character as a string terminator isn't worth the
trouble, because it means your Unicode strings are less portable
to other people's libraries. And if you need to use arbitrary
buffers of Unicode character data, including embedded NULLs
and noncharacters, then you are better off using separate tracking
of buffer length, anyway.

--Ken

Next message: Steve Summit: "Re: Fw: Unicode & space in programming & l10n"
Previous message: Addison Phillips: "Re: Problem with SSI and BOM"
Maybe in reply to: Philippe Verdy: "Fw: Unicode & space in programming & l10n"
Next in thread: Steve Summit: "Re: Fw: Unicode & space in programming & l10n"
Reply: Steve Summit: "Re: Fw: Unicode & space in programming & l10n"
Reply: Richard Wordingham: "Re: Fw: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Sep 22 2006 - 17:12:57 CDT