Re: "UNICODE BOMBER STRIKES AGAIN"

From: Mark Davis (mark@macchiato.com)
Date: Tue Apr 23 2002 - 18:44:10 EDT


You say:

> Lemme see, that's 0x4B 0x00 0x65 0x00 0x6E 0x00.
>
> There's no BOM, and no external tagging as "UTF-16LE," and since
this is
> the Internet, we don't know the endianness of the originating
machine.
>
> So, based on last week's discussion between Ken, Mark Davis, and me,
I
> am *required* to interpret this sequence as U+4B00 U+6500 U+6E00, or
> 䬀攀渀.

That's not quite true. If there is NO external tagging at all, then
there are 6 possible Unicode encodings UTF-8, 16, 16BE, 16LE, 32,
32LE, 32BE (plus a raft of non-Unicode encodings).

You can determine that that particular text is not legal UTF-32*,
since there be illegal code points in any of the three forms. IF you
exclude null code points, again heuristically, that also excludes
UTF-8, and almost all non-Unicode encodings. That leaves UTF-16, 16BE,
16LE as the only remaining possibilities. So look at those:

1. In UTF-16LE, the text is perfectly legal "Ken".
2. In UTF-16BE or UTF-16, the text is the perfectly legal "䬀攀渀".

Thus there are two legal interpretations of the text, if the only
thing you know is that it is untagged. IF you have some additional
information, such as that it could not be UTF-16LE, then you can limit
it further.

"last week's discussion" may have been a bit misleading. Here is some
commentary I had that was not sent to the list.

> Now suppose you want to serialize that data to bytes.
> There are four valid serialization options:
>
> 1. <12 34 00 61 D8 00 DF 00>
> 2. <34 12 61 00 00 D8 00 DF>
> 3. <FE FF 12 34 00 61 D8 00 DF 00>
> 4. <FF FE 34 12 61 00 00 D8 00 DF>
>
> A. If you emit (1), you can legally label it UTF-16BE or UTF-16.
> B. If you emit (2), you can legally label it UTF-16LE.
> C. If you emit (3) or (4), you can legally label it UTF-16.
>
> If you depart from the recommendations of (A), (B), and (C), then
> you have mislabeled your serialized data, and are not in compliance
> with the standard.
>
> Now let's turn things around. Suppose you received serialized
Unicode data
> in the absence of a higher-level protocol (i.e., you don't have a
> valid label or other context to depend on for specifying byte
order).

[Add: Let UTF-32* stand for the serializations UTF-32, UTF-32LE, or
UTF-32BE]
>
> A. If you receive (1), it is illegal as UTF-8 or UTF-32, and could
> only be interpreted as the UTF-16 code unit sequence:
> <1234 0061 D800 DF00>. You *assume* big-endian.

[Since we are talking about a case where someone left off the label,
it could *also* be UTF-16LE corresponding to the code sequence <3412
6100 00D8 00DF>, so]

A. If you receive (1), it is illegal as UTF-8 or UTF-32*, and could
only be either:
(a) UTF-16 or UTF16BE, resulting in the UTF-16 code unit sequence:
<1234 0061 D800 DF00>, or
(b) UTF-16LE, resulting in the UTF-16 code unit sequence: <3412 6100
00D8 00DF>.

> B. If you receive (2), it is illegal as UTF-8 or UTF-32, and could
> only be interpreted as the UTF-16 code unit sequence:
> <3412 6100 00D8 00DF>. You *assume* big-endian.

B. If you receive (2), it is illegal as UTF-8 or UTF-32*, and could
only be either:
(a) UTF-16 or UTF16BE, resulting in the UTF-16 code unit sequence:
<3412 6100 00D8 00DF>, or
(b) UTF-16LE, resulting in the UTF-16 code unit sequence: <1234 0061
D800 DF00>

>
> C. If you receive (3), it is illegal as UTF-8 or UTF-32, and could
> only be intrepreted as the UTF-16 code unit sequence:
> <1234 0061 D800 DF00>. You *deduce* big-endian from the BOM.

C. If you receive (3), it is illegal as UTF-8, UTF-16LE, or UTF-32*,
and could only be either:

(a) UTF-16, resulting in the UTF-16 code unit sequence: <1234 0061
D800 DF00>, or
(b) UTF-16BE, resulting in the UTF-16 code unit sequence: <FEFF 1234
0061 D800 DF00>

>
> D. If you receive (4), it is illegal as UTF-8 or UTF-32, and could
> only be intrepreted as the UTF-16 code unit sequence:
> <1234 0061 D800 DF00>. You *deduce* little-endian from the BOM.

D. If you receive (4), it is illegal as UTF-8, UTF-16BE, or UTF-32*,
and could only be either:

(a) UTF-16, resulting in the UTF-16 code unit sequence: <1234 0061
D800 DF00>, or
(b) UTF-16BE, resulting in the UTF-16 code unit sequence: <FEFF 1234
0061 D800 DF00>

—————

Γνῶθι σαυτόν — Θαλῆς
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: "Doug Ewell" <dewell@adelphia.net>
To: <unicode@unicode.org>
Cc: "Kenneth Whistler" <kenw@sybase.com>; <texin@progress.com>
Sent: Monday, April 22, 2002 20:49
Subject: Re: "UNICODE BOMBER STRIKES AGAIN"

> Kenneth Whistler <kenw@sybase.com> wrote:
>
> > -- K '\0' e '\0' n '\0'
>
> Lemme see, that's 0x4B 0x00 0x65 0x00 0x6E 0x00.
>
> There's no BOM, and no external tagging as "UTF-16LE," and since
this is
> the Internet, we don't know the endianness of the originating
machine.
>
> So, based on last week's discussion between Ken, Mark Davis, and me,
I
> am *required* to interpret this sequence as U+4B00 U+6500 U+6E00, or
> 䬀攀渀.
>
> I'll try, but it won't be easy.
>
> -Doug Ewell
> Fullerton, California
>
>
>
>



This archive was generated by hypermail 2.1.2 : Tue Apr 23 2002 - 19:37:28 EDT