Re: Parsing Unicode strings

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed May 28 2008 - 17:13:50 CDT

  • Next message: Doug Ewell: "Re: Parsing Unicode strings"

    Peter Johansson asked:

    > Is the Unicode-encoded character string self-descriptive?

    No. It isn't a complex object. A Unicode string is simply
    an array of Unicode code units in one of the 3 encoding
    forms: UTF-8, UTF-16, or UTF-32.

    > That is, do
    > I need a priori knowledge that it is encoded as, for example, UTF-8
    > rather than UTF-32?

    Well, yes, but the a priori knowledge you need is of
    the *data type* of the code units you are dealing with.

    For any well-behaved API involving Unicode strings, you
    are only going to be dealing with one encoding form,
    because the data types are different. In terms of
    C data types, UTF-8 strings are unsigned char*, UTF-16 strings
    are unsigned short* (assuming short is 16 bits) and UTF-32
    strings are unsigned long* (assuming long is 32 bits).
    It is a fundamental programming error, when using
    API's that use arrays or pointers for strings, to
    mix data types of this sort through the same API.

    Of course, you may also be dealing with an objected-oriented
    language that defines Unicode String objects on top of
    the fundamental string definitions in the Unicode
    Standard. Such object definitions are outside the scope
    of the Unicode Standard, and their definitions would depend
    on the language you are using. In most cases they would
    simply standardize their usage on a single Unicode
    encoding form. So, for example, a Java String always uses UTF-16.

    > Or, by examining the first byte (or first few
    > bytes) can I determine the encoding?

    If you just get handed a buffer full of "stuff", and
    that "stuff" is claimed to be Unicode, but you don't
    know what encoding form or byte order, then there are
    good heuristics that can tell the various encoding forms
    apart reliably, based on rather small stretches of
    typical Unicode data.

    But in ordinary programming you should never have to deal
    with such situations -- that is really an edge case for
    specialized applications, such as the ones that attempt
    to unscramble the encoding of mislabelled web pages, for
    example.

    --Ken

    >
    > I didn't see anything on this topic in the FAQ.



    This archive was generated by hypermail 2.1.5 : Wed May 28 2008 - 17:16:34 CDT