Markus Kuhn <Markus.Kuhn@cl.cam.ac.uk> wrote:
> It is a real pitty that this went into Unicode and we have now ended
> up with the BOM mess and almost a dozen different encoding forms:
> UCS-2, UCS-4, UTF-1, UTF-7, UTF-8, UTF-16, UTF-32, UTF-16BE, UTF-16LE,
> UTF-32BE, UTF-32LE.
Only the last four have anything to do with byte order.
- UCS-2 is the 16-bit-only version of the "true" UCS-4, with no support
for surrogates. It is the "original" Unicode.
- UTF-16 is UCS-2 with surrogate support added.
- UTF-32 is UCS-4 constrained to Unicode character semantics (as opposed
to ISO 10646) and a range of U-00000000 to U-0010FFFF.
- UTF-1 was created to allow systems built around 8-bit characters to
migrate to Unicode with less pain. (That means Unix and Linux as
much as Windows.)
- UTF-8 is the much-improved version of UTF-1.
- UTF-7 was created to allow Unicode in e-mail despite the presence of
network nodes that can't EVEN deal with the 8th bit. (Unix's hands
are MUCH dirtier than Microsoft's here.)
And while the co-existence of big-endian and little-endian systems that
must communicate with each other is certainly a mess, I hardly consider
the BOM itself to be a mess. It's an elegant solution to an existing
problem that the developers of Unicode and ISO 10646 did not create,
but did anticipate.
-Doug Ewell
Fullerton, California
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT