RE: UTF-8 'BOM'

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Jan 20 2005 - 15:46:27 CST

Next message: Peter Constable: "RE: UTF-8 'BOM'"

Previous message: Peter Constable: "RE: UTF-8 'BOM'"
Maybe in reply to: Peter Constable: "RE: UTF-8 'BOM'"
Next in thread: Hans Aberg: "Re: UTF-8 'BOM'"
Reply: Hans Aberg: "Re: UTF-8 'BOM'"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Geoffrey continued:

> My comments need clarification. I meant that Microsoft insisted
> on writing datafiles containing Unicode data in little-endian
> format. I have no concerns about other data, indeed I was
> writing binary data in little-endian format in the 80s myself.

It wasn't just Microsoft. It was anyone writing significant
applications running on Intel architecture circa 1990. It
was WordPerfect, too -- and as Rick pointed out, the BOM
as concept came from WordPerfect, *not* from Microsoft.

I myself worked at a cross-platform,
network application program company, Metaphor, at the time.
That OS and all its applications ran on both Intel and
Motorola-based hardware at the time. Because it *was* cross-platform
and networked, it chose a big-endian format for all server
store of data, and just put up with the byte-swapping when
reading such data into workstations running on the Intel
hardware.

But if you were an application working in a networked, but
Intel-only architecture context, then writing data (including
Unicode characters) in native format made perfect sense. What
would not have made any sense would be for an application
that is otherwise treating its integral data in little-endian
form for filing to try to make an exception for Unicode plain
text and file *that* data only in big-endian form. That would
have made for big messes indeed.

>
> But back when TUS 1.0 came out, I read the bit about Public
> Interchange to mean anything outside the confines a program's
> core memory.

That is incorrect.

> Whether it was on the wire or written to a file
> that some other program could read, it should be in big-endian
> format.

And that is an incomplete interpretation of the intent at
the time.

> I cheered at this because I had lots of experience
> with inter-platform data exchange and so such a statement
> meant that there would be one fewer worry in dealing with
> multi-byte codepoint representations. And then down the road
> when I heard that Microsoft didn't do it that way I lamented.

But as I indicated, for Microsoft to have attempted to isolate
out Unicode plain text data and treat its byte order differently
than it would other 16-bit integral data types, would have just made
for a huge mess. They did the only logical thing they could
have, in my opinion.

>
> Now it could be my interpretation of The Right Way back then
> was faulty, but I know that my colleagues at the time came
> to the same interpretation of TUS 1.0. I have fuzzy memories
> that a wider circle of people also read it the way I just
> described but I won't lay claim to it. If our learned elders
> care to step forward to confirm or deny my interpretation I
> would be appreciative. Whatever the case I won't beat the
> poor horse anymore.

O.k., for those who don't have access to Unicode 1.0, here is
what the standard said at the time, in October, 1991. Most of
this reflects decisions taken by the Unicode Working Group
in 1990, even prior to the incorporation of the Unicode
Consortium in January, 1991.

============ quote Unicode 1.0, pp. 22-23 ======================

Unicode code points are 16-bit quantities. Machine architectures
differ in the ordering of whether the most significant byte or
the least significant byte comes first. These are known as
"big-endian" and "little-endian" orders, respectively.

The Unicode standard does not specify any order of bytes or bits
inside the 16-bit sequence of a Unicode code point. However, in
Public Interchange and in the absence of any information to the
contrary provided by a higher protocol, a conformant process
may assume that Unicode character sequences it receives are in
the order of the most significant byte first.

NOTE: The majority of all Interchange occurs with processes
running on the same or a similar configuration. This makes
intra-domain Interchange of Unicode text in the domain-specific
byte order fully conformant, and limits the role of the
canonical byte order to Interchange of Unicode text across
domain, or where the nature of the originating domain is
unknown. Processes may prefix data with U+FEFF BYTE ORDER MARK,
and a receiving process may interpret that character as
verification that the text arrived with the byte order expected
by the receiving process. Alternatively, on receiving U+FFFE,
the receiving process may recover text data by attempting to
re-read it in byte-swapped order.

================================================================

There is nothing here about limiting little-endian Unicode to
a "program's core memory". This text was written in full knowledge
that implementers were planning to interchange Unicode text in
little-endian format on networks -- in fact the sentence about "Interchange
of Unicode text in the domain-specific byte order" being fully
conformant was specifically put in to guarantee to the Microsofts,
IBMs, and WordPerfects of the world that what they were going to
be doing was recognized and allowed by the standard.

The big-endian order was referred to as "canonical" only to
indicate that in the absence of any *other* information, you
should assume that was what you were dealing with. And the BOM
convention was invented as a way of signalling byte-order in
contexts where you might not have correct other information
disambiguating the byte order. More text from Unicode 1.0:

============ quote Unicode 1.0, p. 123 =========================

U+FEFF. This Unicode special character is defined to be a signal
of correct byte-order polarity. An application may use this signal
character to explicitly enable the "big-endian" or "little-endian"
byte order to be determined in Unicode text which may exist in
either byte order (for example in networks which mix Intel and
Motorola or RISC CPU architectures for data storage). U+FEFF is
the "correct" or legal order; finding a value U+FFFE is a signal
that text of the "incorrect" byte order for an interpreting process
has been encountered.

=================================================================

O.k., that should be pretty clear about the original intent of
the standard being agnostic regarding which byte order was to
be used, *and* being clear that either order could and would be
found in use in networks interchanging Unicode text.

This text was not such created by happenstance. It reflected the
consensus decision by the committee at the time that both orders
would be in use *and* that the standard must not disallow either
order as being conformant in interchange, providing that the
recipient of the data understood what order it was dealing with
and so could correctly interpret the data as Unicode characters.

People can lament this reality until the cows come home, and wonder
about an alternative universe in which all Unicode data was always
in big-endian order in all contexts. But Unicode wasn't created
in a perfect information processing context -- it was created
in a WordPerfect information processing context in which everyone
was already struggling with trying to integrate two equal and
equally entrenched but opposite CPU architectures into
larger and larger networks. The problem predated Unicode, it
shaped the discussion about byte order of Unicode when it was
created, and it continues to be a problem today, as both architectures
continue to exist in a world where essentially *ALL* computers are
networked on a global scale now.

--Ken

Next message: Peter Constable: "RE: UTF-8 'BOM'"
Previous message: Peter Constable: "RE: UTF-8 'BOM'"
Maybe in reply to: Peter Constable: "RE: UTF-8 'BOM'"
Next in thread: Hans Aberg: "Re: UTF-8 'BOM'"
Reply: Hans Aberg: "Re: UTF-8 'BOM'"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 15:47:07 CST