Re: Parsing Unicode strings

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed May 28 2008 - 17:13:50 CDT

Next message: Doug Ewell: "Re: Parsing Unicode strings"

Previous message: Asmus Freytag: "Re: Parsing Unicode strings"
Maybe in reply to: Peter Johansson: "Parsing Unicode strings"
Next in thread: Peter Zilahy Ingerman, PhD: "A font question"
Reply: Peter Zilahy Ingerman, PhD: "A font question"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Peter Johansson asked:

> Is the Unicode-encoded character string self-descriptive?

No. It isn't a complex object. A Unicode string is simply
an array of Unicode code units in one of the 3 encoding
forms: UTF-8, UTF-16, or UTF-32.

> That is, do
> I need a priori knowledge that it is encoded as, for example, UTF-8
> rather than UTF-32?

Well, yes, but the a priori knowledge you need is of
the *data type* of the code units you are dealing with.

For any well-behaved API involving Unicode strings, you
are only going to be dealing with one encoding form,
because the data types are different. In terms of
C data types, UTF-8 strings are unsigned char*, UTF-16 strings
are unsigned short* (assuming short is 16 bits) and UTF-32
strings are unsigned long* (assuming long is 32 bits).
It is a fundamental programming error, when using
API's that use arrays or pointers for strings, to
mix data types of this sort through the same API.

Of course, you may also be dealing with an objected-oriented
language that defines Unicode String objects on top of
the fundamental string definitions in the Unicode
Standard. Such object definitions are outside the scope
of the Unicode Standard, and their definitions would depend
on the language you are using. In most cases they would
simply standardize their usage on a single Unicode
encoding form. So, for example, a Java String always uses UTF-16.

> Or, by examining the first byte (or first few
> bytes) can I determine the encoding?

If you just get handed a buffer full of "stuff", and
that "stuff" is claimed to be Unicode, but you don't
know what encoding form or byte order, then there are
good heuristics that can tell the various encoding forms
apart reliably, based on rather small stretches of
typical Unicode data.

But in ordinary programming you should never have to deal
with such situations -- that is really an edge case for
specialized applications, such as the ones that attempt
to unscramble the encoding of mislabelled web pages, for
example.

--Ken

>
> I didn't see anything on this topic in the FAQ.

Next message: Doug Ewell: "Re: Parsing Unicode strings"
Previous message: Asmus Freytag: "Re: Parsing Unicode strings"
Maybe in reply to: Peter Johansson: "Parsing Unicode strings"
Next in thread: Peter Zilahy Ingerman, PhD: "A font question"
Reply: Peter Zilahy Ingerman, PhD: "A font question"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed May 28 2008 - 17:16:34 CDT