Re: Variations of UTF-16 (was: Re: "UNICODE BOMBER STRIKES AGAIN")

From: Doug Ewell (dewell@adelphia.net)
Date: Wed Apr 24 2002 - 12:00:17 EDT


Mark Davis <mark@macchiato.com> wrote:

>> I must not *call* the sequence "UTF-16," since that term is
officially
>> reserved for BOM-marked text which can be either little- or
big-endian,
>> or BOMless text which must be big-endian.
>
> Yes, assuming the "BUT" clause applies to (b). That is, the untagged
> byte sequence
>
> 0x4B 0x00 0x65 0x00 0x6E 0x00
>
> could be
> (a) U+4B00 U+6500 U+6E00 ("䬀攀渀"): "UTF-16BE" or "UTF-16"
> (b) U+004B U+0065 U+006E ("Ken"): "UTF-16LE"
> (c) U+004B U+0000 U+0065 U+0000 U+006E U+0000
> ("K<null>e<null>n<null>"): ASCII, UTF-8, CP-1252, etc.
> (d) ...: EBCDEC

Yes, that's what I meant to say.

> Not really arguing, just exploring the issues. But one key is that if
> you are in an environment where untagged data is being exchanged (a
> bad idea, anyway),

But not all mechanisms for exchanging data allow tagging. (Bumper
sticker: "UNTAGGED TEXT HAPPENS")

Here's what caused me to exhume this discussion. Ken made a joke:

> -- K '\0' e '\0' n '\0'

(which I enjoyed) in response to the "UNICODE BOMBER STRIKES AGAIN"
satire about "blank squares" infiltrating otherwise good text. This
representation of "Ken" in untagged, little-endian UTF-16,
misinterpreted as a sequence of 8-bit characters, corresponds to Mark's
example (c) above. It *is* a misinterpretation, right? You're not
really supposed to read this sequence of six bytes as K '\0' e '\0' n
'\0'. That was the whole joke.

And in fact, there is only one "correct" interpretation in this example
(that is, only one interpretation that matches the sender's intent), and
that is U+004B U+0065 U+006E. I contend that U+4B00 U+6500 U+6E00,
whether it makes sense semantically in Chinese or not, is just as
incorrect in this context as an ASCII, EBCDIC, FIELDATA, or BOCU-1
reading.

Note that everything I said before about this example is true:

- there is no BOM
- there is no external tagging as "UTF-16LE" (or anything else)
- we don't know the native byte orientation of the sender's machine

There's a lot of text like this out there, not all of which is intended
as jokes or even illustrations. The Unix and Linux world is very
opposed to the use of BOM in plain-text files, and if they feel that way
about UTF-8 they probably feel the same about UTF-16.

Note also that heuristics in an example like this can be deceiving. A
famous heuristic that applies to this example is to notice that every
other byte is 0, and therefore treat the text as UTF-16LE. For example,
one could take the big-endian interpretation (U+4B00 U+6500 U+6E00),
notice that all of these characters are CJK ideographs, and use that to
deduce (incorrectly) that the text should be UTF-16BE. What if the text
were reversed? ('\0' K '\0' e '\0' n) The latter heuristic would
suggest that the text should be UTF-16LE. Heuristics are not perfect,
but sometimes they're all we've got.

So Ken's joke is encoded in BOMless, little-endian,
non-externally-tagged UTF-16. It's a perfectly legal Unicode
representation, but we can't call it "UTF-16" because that term implies
big-endian. This sounds legalistic, sort of like the warnings on the
Unicode Web site about the correct use of the word "Unicode." But at
least I think I understand the issues a little better, and so the
exploration effort paid off.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Wed Apr 24 2002 - 13:00:08 EDT