Re: Last Call: UTF-16, an encoding of ISO 10646 to Informational

From: Paul Keinanen (keinanen@sci.fi)
Date: Mon Aug 16 1999 - 05:00:22 EDT


At 15:34 15.8.1999 -0700, Markus Kuhn wrote:
>=?ISO-8859-1?Q?Patrik_F=E4ltstr=F6m?= wrote on 1999-08-15 15:55 UTC:
>> I.e. from my point of view, Paul tries to register three different names
>> which can be used in MIME:
>>
>> UTF-16
>> UTF-16LE
>> UTF-16BE
>>
>> I need to, as area director, to know wether it is wrong or right to do this
>> registration.
>
>Very clear answer:
>
>It is WRONG to register both a bigendian and a littleendian variant
>of UTF-16.
>
>Reasons:
>
> a) it has been long-established practice to use *exclusively* bigendian
> convention in ISO, ITU, IETF, ECMA, and Internet RFC protocols

While Internet RFCs are bigendian, it is twisting the truth by claiming that
the other organisations have a long established practice of bigendianess.
Look at any bit serial protocols since Baudot (R)TTY, RS-232 and their CCITT
(currently ITU-T) variants, SDLC, HDLC, X.25 etc. they all transmit the
least significant bit first.

Thus, sending a 16 bit character in big endian byte order, the bits actually
transmitted are: LSB of the most significant byte, intermediate bits from
the most significant byte, MSB of the most significant byte, LSB of the
least significant byte, intermediate bits and finally MSB of the least
significant byte :-).

If 16 bit Unicode some day will completely replace the 8 bit character sets,
then there is not much need for computer architecture designers to create
computers which are byte addressable, but instead the smallest directly
addressable unit would be 16 bit chuncks (words, half-words or whatever they
might be called). Communication system designers can then use 16 bit chuncks
(instead of 8 bit octets) as their basic unit. I have already seen UARTs in
which the number of data bits can be programmed between 5 and 14 bits and I
do not think that it will take too long, until the upper limit is extended
to 16 bits, thus capable of transmitting a Unicode character between the
start and stop bit. No doubt, the bits would be transferred with the LSB first.

In order not to start a holy endian war in this list, I think I should not
comment on the other points. I have used both little endian and big endian
architectures, each have their advantages and disadvantages, so it is hard
for me to see why some especially in the Unix user community promotes one
endianess so strongly.

Anyway, UTF-16BE and UTF-16LE are good unambiguous names and I do not see
why such ambiguous name as UTF-16 should be registred at all.

Paul Keinänen



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT