Re: BOMbs (was Re: Private Use Surrogate Pairs)

From: Doug Ewell (dewell@adelphia.net)
Date: Sat May 11 2002 - 19:38:35 EDT


i18nGuy Tex Texin <tex at i18nguy dot com> wrote:

> Although it can help prevent that confusion, for it to be a *good
> reason*, it first has to be shown (or believed) that not only is there
a
> need for an indicator of endian-ness, but there is also a need for a
> (weak) encoding indicator.
>
> Second, it has to be shown (or believed) that the indicator should be
> this particular value 00 00 FE FF and not another one that doesn't
offer
> this potential confusion to begin with.
>
> I can buy endian-ness. I am not sold on (weak) encoding signatures.

These are good observations. I wasn't part of the decision-making
process (then or now), but until Ken or Asmus or Mark comes up with a
more authoritative response, here is how I see this issue.

The decision to encode some sort of byte-order mark probably occurred
early in the design of Unicode. Remember what things were like 12 years
ago, when this decision was likely made:

1. Plain text files were very common, much more common than fancy text,
but they were generally not marked with respect to character encoding
(except, I suppose, in the ISO 2022 world). This caused problems when
files were interchanged between MS-DOS and Windows 2.x (or Unix or other
8859-1-ish systems) and the HP world with its "Roman-8" CCS. (I
definitely remember the heuristics involved in auto-detecting CP437 vs.
CP1252.)

2. Endianness was already known to be an issue, particularly between
the Intel (PC) and Motorola (Mac) worlds. Considering the speed of
hardware at the time, conversion between big-endian and little-endian
was widely regarded as a performance bottleneck (despite the fact that
all processors contained a SWAB-style machine instruction). Holy wars
developed over the "correct" byte order.

3. Software written with integer data in mind often used the value -1
as a "sentinel" to signify the end of normal data. This practice was
common not only in ASCII-based systems, but in EBCDIC as well (where the
term EO (Eight Ones) was used).

4. There was widespread understanding that 16-bit Unicode was being
introduced to an overwhelmingly 8-bit world. Unicode text data was in
danger of being misinterpreted as 8-bit data, or data of the opposite
byte order.

There was a sense that it was necessary to introduce a character value
that would function not only as a byte order mark, but also as what Tex
calls a "weak encoding indicator," because for some time it would
continue to be necessary to distinguish Unicode from non-Unicode data.
(Today, with most Unicode data in 8-bit-friendly UTF-8, we see that this
need has not gone away.) Again, it was *not* common at the time for
text data to be supplemented with out-of-band encoding information.
SGML, HTML, XML, etc. provide great mechanisms for this today, but in
1990 they either did not exist or were not in common use for ordinary
text.

0xFFFF could not be used as a signature because of the prevalent use
of -1 as a sentinel value. And in any case, if indication of byte order
was a goal, then clearly no value of the form U+xxyy could be used where
xx = yy. 0xFE and 0xFF were found to be particularly infrequent (in
either order) at the beginning of contemporary text files.

If you are going to define a code point U+xxyy as a byte order mark, it
makes sense to reserve U+yyxx as a noncharacter (modern terminology).
This approach introduces less "potential confusion" than any other
alternative. Defining U+FEFF as the BOM and U+FFFE as the noncharacter,
instead of the other way around, permitted the two noncharacter values
U+FFFE and U+FFFF to be contiguous, which seems more elegant somehow
than if they were separated by a 256-character row.

Later, when "Unicode" came to mean not only UTF-16 but also UTF-7,
UTF-8, UTF-32, SCSU, BOCU, ACE, etc., the "encoding indicator" function
of the BOM expanded, so that it distinguished UTF-16 not only from
non-Unicode charsets, but also from UTF-8, UTF-32, etc. The "potential
confusion" only occurs here when deciding between little-endian UTF-16
and UTF-32, and when allowing for the possibility of U+0000 in ordinary
text (quite an unlikely scenario, IMHO).

The other source of confusion, of course, has to do with U+FEFF being
given a second role as zero-width no-break space (and having its name
changed from BYTE ORDER MARK), and even then the confusion only exists
in the equally unlikely scenario that a ZWNBSP is assumed to be valid at
the start of a text stream (where it doesn't have the requisite two
adjacent characters between which to prevent breaking). In any event,
we are now up to Unicode 3.2, where U+2060 WORD JOINER is poised to
remove this second role from U+FEFF, thus removing the source of
confusion.

In summary, I think "the need for a (weak) encoding indicator" had
already been shown (or believed), and the choice of U+FEFF was made with
that evidence or belief already in hand.

I would definitely appreciate any assistance from the Unicode pioneers
if I got any of these facts or assumptions wrong.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Sat May 11 2002 - 20:21:45 EDT