RE: Translated IUC10 Web pages: Experimental Results

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Sun Feb 09 1997 - 04:40:09 EST


All the nice historical discussions on chip architecture aside, you
will find that both The Unicode Standard and ISO 10646 are
explicit in specifying MSB order ONLY WHERE data is "serialized into
bytes". This was done deliberately to allow conformant APIs to be
defined without requiring the parameters in memory buffers to be in Big
endian form on a little endian processor. (A string of Unicode characters
is a string of short integers, not a string of twice as many bytes).

On disk, if you view data as byte streams (as UN*X) systems tend
to do, you could argue that data are serialized and therefore Big endian.
On the other hand, think about memory mapping files. On systems
like NT these are one of the few ways for processes to share memory buffers,
if not the only one. Again, it's essential to allow conformant applications
to be written w/o requiring the transposition of memory buffers.

All this was discussed both in Unicode and SC2/WG2 many years ago
when the conformance requirements were first established.

When you start considering networks, you need a protocol to
distinguish 'nativist' usage from 'canonical' usage. This is where
the BOM convention 0xFEFF / 0xFFFE was introduced.

Unfortunately, the usage for the BOM character as signature is a
recommendation only, not normatively required. That's true for ISO 10646
and Unicode. If I write files for access in 16-bit chunks (instead of a
one byte at a time) these files can be BOM-lessly conformant since
I am not 'serializing' the bytes of each Unicode character.

While you therefore can't throw the conformance clause at such an
aplication, it is indeed bad practice to produce plain text files without
a BOM. I would argue that this is true not only for applications on Little
Endian, but ALSO for big endian processors. One reason is to give
everybody a chance to distinguish known from unknown byte orders, the other
reason, also mentioned explicitly in both ISO 10646 and the
Unicode Standard is to aid in distinguishing plain text 8-bit files from
plain text Unicode files. (For example NT made the initial choice to
overload *.TXT as a common extension for both Unicode and
non-Unicode plain text files. As a consequence, Notepad does not read
Unicode files w/o a BOM (even little endian ones)).

On the web a higher level of precision is needed. Specifying protocols
in terms of serialized byte streams and therefore requiring MSB
canonical byte ordering increases data security and in these situtaions,
the overhead of transposition (in the worst case twice) is fully acceptable.
It's hard to find any arguments there.

A./

PS: for folks interested in historical footnotes: the BOM was proposed
by WordPerfect at the time. While MS certainly implements it, it is not
their invention. As the convention is defined in both ISO 10646 and the
Unicode Standard, I would consider it inappropriate to label it with
any company's name as some of the participants in this discussion
have done.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:34 EDT