Re: "UNICODE BOMBER STRIKES AGAIN"

From: Mark Davis (mark@macchiato.com)
Date: Wed Apr 24 2002 - 14:41:19 EDT


Unfortunately, the language in C3.1 is a bit archaic; it is referring
specifically to the "UTF-16" encoding scheme. If you know you are
working with UTF-16, and you have no other information, then you do
have to use big-endian.

If, however, you only know that it is one of UTF-16BE, UTF-16LE, or
UTF-16 (plain)), then there are more choices.

Similarly, if you know that the text is limited to one of UTF-32LE or
UTF-16LE, then you actually know that the text must be little-endian.

Mark
—————

Γνῶθι σαυτόν — Θαλῆς
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: "Yves Arrouye" <yves@realnames.com>
To: "'Mark Davis'" <mark@macchiato.com>; "Doug Ewell"
<dewell@adelphia.net>; <unicode@unicode.org>
Cc: "Kenneth Whistler" <kenw@sybase.com>; <texin@progress.com>
Sent: Wednesday, April 24, 2002 10:39
Subject: RE: "UNICODE BOMBER STRIKES AGAIN"

> You can determine that that particular text is not legal UTF-32*,
> since there be illegal code points in any of the three forms. IF you
> exclude null code points, again heuristically, that also excludes
> UTF-8, and almost all non-Unicode encodings. That leaves UTF-16,
16BE,
> 16LE as the only remaining possibilities. So look at those:
>
> 1. In UTF-16LE, the text is perfectly legal "Ken".
> 2. In UTF-16BE or UTF-16, the text is the perfectly legal "䬀攀渀".
>
> Thus there are two legal interpretations of the text, if the only
> thing you know is that it is untagged. IF you have some additional
> information, such as that it could not be UTF-16LE, then you can
limit
> it further.

Actually, I also think that without any external information about the
encoding except that it is some UTF-16, it *has to* be interpreted as
being
most significant byte first. I agree that it could be either UTF-16LE
or
UTF-16BE/UTF-16, but in the absence of any other information, at this
point
in time, it is ruled by the text of 3.1 C3 of TUS 3.0 and the reader
has no
choice but to declare it UTF-16.

Now what about auto-detection in relation to this conformance clause?
Readers that first try to be smart by auto-detecting encodings could
of
course pick any of these as the 'auto-detected' one. Does that violate
3.1
C3's interpretation of bytes? I would say that as long as the
auto-detector
is seen as a separate process/step, one can get away with it, since by
the
time you look at the bytes to process the data, their encoding has
been set
by the auto-detector.

YA



This archive was generated by hypermail 2.1.2 : Wed Apr 24 2002 - 15:30:55 EDT