Re: Parsing Unicode strings

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Wed May 28 2008 - 16:56:06 CDT

Next message: Kenneth Whistler: "Re: Parsing Unicode strings"

Previous message: Petite Abeille: "Re: Parsing Unicode strings"
In reply to: Peter Johansson: "Parsing Unicode strings"
Next in thread: Kenneth Whistler: "Re: Parsing Unicode strings"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 5/28/2008 1:49 PM, Peter Johansson wrote:
> Is the Unicode-encoded character string self-descriptive? That is, do
> I need /a priori/ knowledge that it is encoded as, for example, UTF-8
> rather than UTF-32? Or, by examining the first byte (or first few
> bytes) can I determine the encoding?
UTF-32 will have every 4th byte null (0x00). Always, and no matter what
the text contains. LE and BE differ only in whether these null bytes
lead or trail in each group of four bytes. (It's the MSB that's null)

In essence, that makes UTF-32 self-describing for anything more than two
characters. Your example didn't mention UTF-16, so if the only other
alternative is UTF-8, the null bytes are a very definite signature.
(UTF-16 text in ASCII/Latin-1 has every other byte a null byte, so that
would include the every fourth byte case).

For text on the BMP, you would have every alternate pair of bytes being
null bytes in UTF-32, which is something you don't get for UTF-16 unless
you allow the document to contain null terminated strings containing
single characters.

UTF-32 that's off the BMP could look like UTF-16 where every other
character is a control code. With increasing length of text
progressively unlikely (and even so, currently only 01, 02, 0E, 0F and
10 would correspond to assigned or private use UTF-32 characters, not
the most frequently used control bytes, these). So, checking not only
for the MSB, but the next byte in the putative UTF-32 text, would
establish quickly whether it's UTF-32 or rather UTF-16.

In short, discriminating among UTF's (if other encodings are ruled out)
is a rather definite proposition. The one exception is UTF-16 BE vs LE
because it's easy to construct cases where one looks like odd, but
legal, text in the other. Therefore, the use of BOM.

Where other encodings could be present you get the complex issue of
encoding recognition, and that's where adding a BOM really helps both to
establish the encoding as Unicode and to declare the encoding scheme.
>
> I didn't see anything on this topic in the FAQ.
>
> Regards,
>
> Peter Johansson
>
> Congruent Software, Inc.
> 98 Colorado Avenue
> Berkeley, CA 94707
>
> (510) 527-3926
> (510) 527-3856 FAX
>
> PJohansson@ACM.org
>

Next message: Kenneth Whistler: "Re: Parsing Unicode strings"
Previous message: Petite Abeille: "Re: Parsing Unicode strings"
In reply to: Peter Johansson: "Parsing Unicode strings"
Next in thread: Kenneth Whistler: "Re: Parsing Unicode strings"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed May 28 2008 - 16:58:21 CDT