RE: Translated IUC10 Web pages: Experimental Results

From: Martin J. Duerst (mduerst@ifi.unizh.ch)
Date: Mon Feb 10 1997 - 08:40:53 EST


On Sun, 9 Feb 1997 Asmus Freytag wrote:

> All the nice historical discussions on chip architecture aside, you
> will find that both The Unicode Standard and ISO 10646 are
> explicit in specifying MSB order ONLY WHERE data is "serialized into
> bytes". This was done deliberately to allow conformant APIs to be
> defined without requiring the parameters in memory buffers to be in Big
> endian form on a little endian processor. (A string of Unicode characters
> is a string of short integers, not a string of twice as many bytes).

Exactly. Having to keep 16bit quantities in memory different than
the machine assumes would be a mess beyond limits.

> On disk, if you view data as byte streams (as UN*X) systems tend
> to do, you could argue that data are serialized and therefore Big endian.
> On the other hand, think about memory mapping files. On systems
> like NT these are one of the few ways for processes to share memory buffers,
> if not the only one. Again, it's essential to allow conformant applications
> to be written w/o requiring the transposition of memory buffers.

This is where problems start. If you are using a file just for
sharing memory among processes, such as, in some way, in the case
of the Clipboard, then it's all fine. But plain text files are rarely
used for sharing memory, and much more for exchanging data among
different machines and different software. And in this case,
serialization is important, and is quite cheap compared to
the mess you get without it.

> All this was discussed both in Unicode and SC2/WG2 many years ago
> when the conformance requirements were first established.

Definitely. Just that the importance of exchanging files across
the internet, with the growth of the Web, wasn't know at that time.

> When you start considering networks, you need a protocol to
> distinguish 'nativist' usage from 'canonical' usage. This is where
> the BOM convention 0xFEFF / 0xFFFE was introduced.

That's one way to do it. But the internet in general uses another
approach. It is to specify a common format. For example, in the
case of HTTP, the use of the BOM as a magic number is not very
important, because the right thing to do is to use an appropriate
"charset" parameter in the HTTP header. And that parameter
clearly means BIG-endian, and nothing else. In this case,
0xFFFE is just a defence against ignorant programmers that
never even heard much about endianness and don't know how
to handle it. It's not there so that big companies who are
(supposedly) at the forefront of technology and have ample
experienced staff can find an excuse for a bad job.

> Unfortunately, the usage for the BOM character as signature is a
> recommendation only, not normatively required. That's true for ISO 10646
> and Unicode. If I write files for access in 16-bit chunks (instead of a
> one byte at a time) these files can be BOM-lessly conformant since
> I am not 'serializing' the bytes of each Unicode character.

The question is not whether or not you serialize it or not. The
question is whether or not some standard specifies serialization,
and whether or not the recipients assume serialization. There is
no trace in a file of whether the program that wrote it out wrote
it in 16-bit chunks or one byte at a time. The file systems, not
only on UNIX, and the networks, all treat data as a stream of
octets. No file system and no network I know treats data as
16-bit items.

> While you therefore can't throw the conformance clause at such an
> aplication, it is indeed bad practice to produce plain text files without
> a BOM.

Yes, definitely.

>I would argue that this is true not only for applications on Little
> Endian, but ALSO for big endian processors.

Of course. Ideally, you write your application so that it runs on
little-endian as well as big-endian processors, and only writes
big-endian files including a BOM. My own prototype system, with
altogether probably around 2 man-years of effort (including
Japanese input and Arabic and Tamil rendering among other things),
actually does, without compile-time or run-time flags. Why is
this so difficult to get right for such a big company?

>One reason is to give
> everybody a chance to distinguish known from unknown byte orders, the other
> reason, also mentioned explicitly in both ISO 10646 and the
> Unicode Standard is to aid in distinguishing plain text 8-bit files from
> plain text Unicode files.

The BOM, as a magic number, is extremely valuable. Implemented correctly,
it can be a great help to promote Unicode. Immagine the following scenario:

Multilingual/multiscript user dealing with several encodings.
For traditional encodings, this means either that the user has
to specify/guess the encoding each time a file is read in, or
that the application will get it wrong most of the time. For
Unicode, with a BOM, it means that without any work by the
user, the file is just read in nicely. Multilingual users
will very quickly get to appreciate this nice feature of
Unicode, even if they might not care about the rest of
Unicode's advantages.

>(For example NT made the initial choice to
> overload *.TXT as a common extension for both Unicode and
> non-Unicode plain text files. As a consequence, Notepad does not read
> Unicode files w/o a BOM (even little endian ones)).
>
> On the web a higher level of precision is needed. Specifying protocols
> in terms of serialized byte streams and therefore requiring MSB
> canonical byte ordering increases data security and in these situtaions,
> the overhead of transposition (in the worst case twice) is fully acceptable.
> It's hard to find any arguments there.

Very true. Just that nowadays, almost every file written out at one
time may end up on the web sooner or later. So it's better to do
the work up front than to produce confusion and bad rumors.

> PS: for folks interested in historical footnotes: the BOM was proposed
> by WordPerfect at the time. While MS certainly implements it, it is not
> their invention. As the convention is defined in both ISO 10646 and the
> Unicode Standard, I would consider it inappropriate to label it with
> any company's name as some of the participants in this discussion
> have done.

Nice to know. I don't think anybody has nameb the BOM after any
company in the recent discussion.

Regards, Martin.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT