RE: UTF-8 'BOM'

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Jan 20 2005 - 15:46:27 CST

  • Next message: Peter Constable: "RE: UTF-8 'BOM'"

    Geoffrey continued:

    > My comments need clarification. I meant that Microsoft insisted
    > on writing datafiles containing Unicode data in little-endian
    > format. I have no concerns about other data, indeed I was
    > writing binary data in little-endian format in the 80s myself.

    It wasn't just Microsoft. It was anyone writing significant
    applications running on Intel architecture circa 1990. It
    was WordPerfect, too -- and as Rick pointed out, the BOM
    as concept came from WordPerfect, *not* from Microsoft.

    I myself worked at a cross-platform,
    network application program company, Metaphor, at the time.
    That OS and all its applications ran on both Intel and
    Motorola-based hardware at the time. Because it *was* cross-platform
    and networked, it chose a big-endian format for all server
    store of data, and just put up with the byte-swapping when
    reading such data into workstations running on the Intel
    hardware.

    But if you were an application working in a networked, but
    Intel-only architecture context, then writing data (including
    Unicode characters) in native format made perfect sense. What
    would not have made any sense would be for an application
    that is otherwise treating its integral data in little-endian
    form for filing to try to make an exception for Unicode plain
    text and file *that* data only in big-endian form. That would
    have made for big messes indeed.

    >
    > But back when TUS 1.0 came out, I read the bit about Public
    > Interchange to mean anything outside the confines a program's
    > core memory.

    That is incorrect.

    > Whether it was on the wire or written to a file
    > that some other program could read, it should be in big-endian
    > format.

    And that is an incomplete interpretation of the intent at
    the time.

    > I cheered at this because I had lots of experience
    > with inter-platform data exchange and so such a statement
    > meant that there would be one fewer worry in dealing with
    > multi-byte codepoint representations. And then down the road
    > when I heard that Microsoft didn't do it that way I lamented.

    But as I indicated, for Microsoft to have attempted to isolate
    out Unicode plain text data and treat its byte order differently
    than it would other 16-bit integral data types, would have just made
    for a huge mess. They did the only logical thing they could
    have, in my opinion.

    >
    > Now it could be my interpretation of The Right Way back then
    > was faulty, but I know that my colleagues at the time came
    > to the same interpretation of TUS 1.0. I have fuzzy memories
    > that a wider circle of people also read it the way I just
    > described but I won't lay claim to it. If our learned elders
    > care to step forward to confirm or deny my interpretation I
    > would be appreciative. Whatever the case I won't beat the
    > poor horse anymore.

    O.k., for those who don't have access to Unicode 1.0, here is
    what the standard said at the time, in October, 1991. Most of
    this reflects decisions taken by the Unicode Working Group
    in 1990, even prior to the incorporation of the Unicode
    Consortium in January, 1991.

    ============ quote Unicode 1.0, pp. 22-23 ======================

    Unicode code points are 16-bit quantities. Machine architectures
    differ in the ordering of whether the most significant byte or
    the least significant byte comes first. These are known as
    "big-endian" and "little-endian" orders, respectively.

    The Unicode standard does not specify any order of bytes or bits
    inside the 16-bit sequence of a Unicode code point. However, in
    Public Interchange and in the absence of any information to the
    contrary provided by a higher protocol, a conformant process
    may assume that Unicode character sequences it receives are in
    the order of the most significant byte first.

    NOTE: The majority of all Interchange occurs with processes
    running on the same or a similar configuration. This makes
    intra-domain Interchange of Unicode text in the domain-specific
    byte order fully conformant, and limits the role of the
    canonical byte order to Interchange of Unicode text across
    domain, or where the nature of the originating domain is
    unknown. Processes may prefix data with U+FEFF BYTE ORDER MARK,
    and a receiving process may interpret that character as
    verification that the text arrived with the byte order expected
    by the receiving process. Alternatively, on receiving U+FFFE,
    the receiving process may recover text data by attempting to
    re-read it in byte-swapped order.

    ================================================================

    There is nothing here about limiting little-endian Unicode to
    a "program's core memory". This text was written in full knowledge
    that implementers were planning to interchange Unicode text in
    little-endian format on networks -- in fact the sentence about "Interchange
    of Unicode text in the domain-specific byte order" being fully
    conformant was specifically put in to guarantee to the Microsofts,
    IBMs, and WordPerfects of the world that what they were going to
    be doing was recognized and allowed by the standard.

    The big-endian order was referred to as "canonical" only to
    indicate that in the absence of any *other* information, you
    should assume that was what you were dealing with. And the BOM
    convention was invented as a way of signalling byte-order in
    contexts where you might not have correct other information
    disambiguating the byte order. More text from Unicode 1.0:

    ============ quote Unicode 1.0, p. 123 =========================

    U+FEFF. This Unicode special character is defined to be a signal
    of correct byte-order polarity. An application may use this signal
    character to explicitly enable the "big-endian" or "little-endian"
    byte order to be determined in Unicode text which may exist in
    either byte order (for example in networks which mix Intel and
    Motorola or RISC CPU architectures for data storage). U+FEFF is
    the "correct" or legal order; finding a value U+FFFE is a signal
    that text of the "incorrect" byte order for an interpreting process
    has been encountered.

    =================================================================

    O.k., that should be pretty clear about the original intent of
    the standard being agnostic regarding which byte order was to
    be used, *and* being clear that either order could and would be
    found in use in networks interchanging Unicode text.

    This text was not such created by happenstance. It reflected the
    consensus decision by the committee at the time that both orders
    would be in use *and* that the standard must not disallow either
    order as being conformant in interchange, providing that the
    recipient of the data understood what order it was dealing with
    and so could correctly interpret the data as Unicode characters.

    People can lament this reality until the cows come home, and wonder
    about an alternative universe in which all Unicode data was always
    in big-endian order in all contexts. But Unicode wasn't created
    in a perfect information processing context -- it was created
    in a WordPerfect information processing context in which everyone
    was already struggling with trying to integrate two equal and
    equally entrenched but opposite CPU architectures into
    larger and larger networks. The problem predated Unicode, it
    shaped the discussion about byte order of Unicode when it was
    created, and it continues to be a problem today, as both architectures
    continue to exist in a world where essentially *ALL* computers are
    networked on a global scale now.

    --Ken



    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 15:47:07 CST