Re: Unicode and end users

From: David Hopwood (david.hopwood@zetnet.co.uk)
Date: Thu Feb 14 2002 - 10:15:24 EST


-----BEGIN PGP SIGNED MESSAGE-----

Doug Ewell wrote:
> "Lars Kristan" <lars.kristan@hermes.si> wrote:
> > This again makes me think that UTF-8 and UTF-16 are not both Unicode.

No charset/CEF should be called Unicode; that would be ambiguous and
inaccurate. Unicode is the name of a standard, a Coded Character Set,
and an Abstract Character Repertoire.

> > Maybe UTF-16 is 'more' Unicode right now, because of the past. But
> > maybe UTF-8 will be 'more' Unicode in the future, because it can
> > contain invalid sequences and these can be properly interpreted by
> > someone at a later time.
> > Unless UTF-16 has that same ability, it will lose the battle of being
> > an 'equally good Unicode format'.

The UTFs are not equally good, for most application-dependent definitions
of "good".

In hindsight, BTW, it would have been possible to fit all of the alphabets,
abugidas and abjads in modern use into two bytes, and the rest into three
bytes (up to 0x410C0 = 266432 characters, which is plenty of room for
ideographs and historic or constructed scripts) while still preserving
US-ASCII compatibility, and almost all of the other nice properties that
UTF-8 has. (The exception is that naïve substring searching could find a
match starting part-way through a character - but it would be easy to
reject false matches by looking at the previous byte.)

Here's how it could have been done, avoiding irregular sequences:

  Byte sequence Code point sequence
  ------------- -------------------
  0xxxxxxx -> xxxxxxx
  10xxxxxx -> 0x80 + xxxxxx
  11xxxxxx 10yyyyyy -> 0xC0 + xxxxxxyyyyyy
  11xxxxxx 11yyyyyy 10zzzzzz -> 0x10C0 + xxxxxxyyyyyyzzzzzz
  1*(11xxxxxx) 0yyyyyyy -> <error mark>, yyyyyyy

Note that the last byte of every valid encoded character is < 0xC0, and
therefore it's possible to resynchronise by going backwards or forwards
(up to two bytes) until you hit such a byte. This format is infinitely
extensible, but 266432 characters should be sufficient, at least until
we meet the aliens :-)

[It's interesting that RFC373, written in 1972, anticipates a universal
coded character set and estimates that "17 bits should be enough even to
include Chinese", which is absolutely spot on. The above format gives
just over 18 bits.]

There would be 0x1000 = 4096 two-byte characters, which is sufficient to
encode Latin, Greek/Coptic, Cyrillic, Armenian, Devanagari, Bengali, Gurmukhi,
Gujarati, Oriya, Tamil, Telugu, Kannada, Malayam, Sinhala, Thai, Lao, Tibetan,
Myanmar, Georgian, Ethiopic, Cherokee, Tagalog, Hanunoo, Buhid, Tagbanwa,
Khmer, Mongolian, Hangul (using cluster Jamo), Hiragana, Katakana, Bopomofo,
Hebrew, Arabic, Syriac, Thaana, and a respectable subset of mathematical
characters, symbols and punctuation. They wouldn't even need to be squashed
up very much, compared to the current encodings. Of the modern scripts, this
would just leave UCAS, Yi, and Han that require 3 bytes.

The above assumes a fully decomposed encoding, with the most commonly used
combining marks encoded in the 64 non-ASCII single-byte codes from 0x80..0xBF.
It would be a bit more complicated to transcode *to* a legacy charset using
precomposed characters, but normalisation would be much easier.

IOW, all of the UTFs we have are decidedly suboptimal. Not that I'm
suggesting changing them now.

> I don't think the fact that invalid sequences are possible in UTF-8 and
> not in UTF-16 makes UTF-8 inferior, or any less "Unicode."

Invalid (more precisely ill-formed) sequences certainly are possible in
UTF-16: unpaired surrogates are ill-formed.

> It was designed that way. Invalid sequences always represent a problem,
> just like line noise. They should not be treated as a normal situation.

Yes, exactly.

- --
David Hopwood <david.hopwood@zetnet.co.uk>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip

-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQEVAwUBPGt+bjkCAxeYt5gVAQFmGwf/aI3Pt3tvhponXZN1lEzXzY9SgxvpTqQk
b6xwR/e0l9L3a08o9NpRaC8OIj7TXazDBycz+c0YcnWKXVJ5zYF2AezAuCOl+DQh
mWS6TfqnqFH/k2iXnxqrxfcysErCfKRvhiuhNrTcZ0fRVJ0oxGselxeDXWiT9IZZ
GCxILsKqbUtHR1rKopHdxyFWXnycmYLfFy5Mca3v6GCI65O8nyScwq1njGJQf31D
vCcaNcAK/AQ2UTED/XaixD3EdSSOsftjBRe8/JPpfDuxiRbyzGm+5Q03ed9011HO
E1j7A2NxZyIPfIoGXJBMpGOxUGCyBa3pDN8k8/PQ37pfG2WE5rdJ2g==
=/DHN
-----END PGP SIGNATURE-----



This archive was generated by hypermail 2.1.2 : Fri Feb 15 2002 - 11:57:25 EST