Re: Unicode Myths

From: David Hopwood (david.hopwood@zetnet.co.uk)
Date: Tue Apr 09 2002 - 16:05:57 EDT


-----BEGIN PGP SIGNED MESSAGE-----

Mark Davis wrote:
> Thanks to the many people who suggested Myths. I have posted a new
> version on
>
> http://www.macchiato.com/slides/UnicodeMyths.ppt
>
> with new ones included after slide 8.

Here are my comments.

Slide 5
 - Unassigned characters are characters (clause C6 in Chapter 3 of the
   Standard notwithstanding).
   Search the standard for "unassigned character"; it occurs several
   times. Also, several clauses and definitions are incorrect or
   incomplete if unassigned code points do not correspond to characters:
   at least C9, C10, C13, D1, D6, D7, D9 (which should not restrict to
   "graphic characters"), D11, D13, D14, D17..D24, D28, and the note
   after D29.
 - Format control characters are also characters.
 - Private-use characters are definitely characters.
 - The values D800..DFFF are not valid code point values, they are UTF-16
   code unit values (the valid Unicode code point space is 0..D7FF union
   E000..10FFFF.)

In computer jargon, "characters" are, by definition, the things that are
enumerated in coded-character-sets (regardless of whether or not they are
displayed as simple spacing glyphs, have control functions, are not yet
assigned, or have any other strange properties). Apart from the unfortunate
"noncharacter" terminology (which would have better called "internal-use
characters"), all valid Unicode code points *do* correspond to characters
in this sense.

Note that there is no conflict between this jargon meaning of "character",
and its original meaning as a unit of text. The following definition may
be helpful:

  An "orthocoding" is a set of rules for representing texts in some language
  or symbology as sequences of characters. I.e. an orthocoding relates texts
  to sequences of characters, in the same way that an orthography relates
  texts to arrangements of glyphs.

When someone asks the FAQ about how to encode Indic half-forms in Unicode,
for instance, they have in mind an orthocoding in which the half-forms are
characters, and are asking how that orthocoding relates to the Unicode
orthocodings for Indic scripts (even if they don't think of the question
in that way). Input methods/keyboard layouts also effectively define an
orthocoding in which each keystroke corresponds to a "character".

The fact that orthocodings not designed for computers normally don't use
control characters (apart possibly for "new line", "new paragraph", etc.)
does not mean that the controls in Unicode, ASCII, etc. are not characters.
The practical effect of defining controls not to be characters would just
be to require awkward constructions like "sequence of characters or controls"
all the time, instead of just "sequence of characters".

While we're on this subject, it's also redundant to say "abstract character":
*all* characters are abstractions, and the definition of this term (D3 in
Chapter 3 of the Unicode Standard) doesn't mean anything different to plain
"character", as defined above.

Slide 8
 - the first bullet should say "grapheme cluster != code point".

Slide 13
 - the International currency symbol is a poor analogy to a hypothetical
   "decimal point" character: that symbol has a defined appearance
   different to any specific currency symbol, and currency symbols
   shouldn't be automatically changed according to locale anyway, unlike
   decimal points (e.g. €1.50 never means the same as $1.50, but "1,5"
   can mean the same as "1.5").
 - a better argument against encoding a "decimal point" character is that
   it isn't distinguished on keyboards (so ',' and '.' would still have
   to be interpreted according to context anyway).

Slide 17
 - the definition of "compatibility composite" is incorrect.
 - support for round-trip transcoding does *not* require encoding of
   compatibility characters, in most cases (because only strings need
   to be round-tripped, not individual characters). Most compatibility
   characters are unnecessary and should not have been encoded.
 - what security risks? Any risks due to encoding of "look-alike"
   characters have nothing to do with *compatibility* characters per-se;
   they occur also for non-compatibility look-alike characters (e.g.
   Latin U+0065 'a' and Cyrillic U+0430 'а', which obviously cannot
   be sensibly unified).

In any case, IMHO compatibility decomposables are *almost* all bad.
(The decomposable CJK radicals, for example, are not bad, but should
not have compatibility decompositions in the first place.) Whether or
not you agree with this, it is a supportable opinion, not a myth. I'm
sure that I can defend this point of view if anyone wants to discuss it
in more detail.

Slide 24
 - a pure 16-bit design was *not* possible, even without composed and
   compatibility characters. This should have been recognised from the
   start. RFC373 (published in 1972!) correctly estimated that
   17 bits were required for a universal coded-character-set. (It might
   not sound as though there is very much difference between 16 and
   17 bits, but there is.)

Arguably, the 16-bit design was a serious mistake - a variable length
ASCII-compatible encoding (say 1 to 3 bytes, which allows a code space
of ~18 bits with alphabetic scripts in the 2-byte subspace), would
have fit much better into existing practice, and could have been treated
as just another "codepage" extending ASCII, rather than requiring
completely new APIs. (Think about how long it took for Windows 9x to
properly support two API sets.) At the very least, 16 bits was always
going to impose undesirable constraints and compromises.

One of the advantages of having a code space larger than the number of
characters that are actually required, is that it provides enough room
to designate the most important properties (for example, major category,
case, combining class, bidirectional class, and line/word/grapheme break
properties) to sufficiently large unassigned ranges. Unicode doesn't do
this (except for default bidirectional class), but it could have done if
the *original* design had had a large enough code space (>= 18 bits),
and that would have had many advantages.

Slide 29
 - there are 1,112,064 valid Unicode code points, not 1,114,112.
   (D800..DFFF are not valid code points.)

Slide 33
 - the first bullet point is correct - human writing systems are complex.
   However, the second bullet point is extremely dubious: there was no
   need for multiple byte orders (it would have been perfectly reasonable
   to specify a fixed byte order for external representation), multiple
   encoding and normalisation forms, or for the vast majority of compatibility
   characters. There is a considerable amount of *unnecessary* complexity in
   Unicode, that is not imposed by the problem of defining a universal
   coded-character-set. Some of that complexity would have been difficult to
   avoid without the benefit of hindsight, but some could and should have
   been avoided.
 - "Yen vs backslash" doesn't belong in the list because it is not a
   complexity of Unicode. U+005C is unambiguously backslash; the fact that
   0x5C can mean either yen or backslash in, e.g. the IANA "Shift_JIS"
   charset, is a problem with the definition of that charset (which could
   only be fixed by changing the IANA charsets registry), and has nothing
   to do with Unicode.

- --
David Hopwood <david.hopwood@zetnet.co.uk>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip

-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQEVAwUBPLNF2TkCAxeYt5gVAQEP+Af+P6NlrhpRaCu/nJnhfippikMK/Yf8mPw9
t4ZNf9b+SKrXPolK3fhLZruGxFJ8doBf7waL2qMyajUzlqDBc6KkltaEDLThl4DM
UNgdJwgJiNpqFsgOP2f6ruRcOfmSrqPT7F0l9rghS+doP6tED/9Kx9xVUNPSA3HN
+iIXl0A+0nnUJvAhV7IaurrH3cWTPUFRNerciHqBA5PXSahqYKvB+JhbjNO+lZYS
4Mc/sM9VNwB6oyzo9sasmywEZMCSOONA2erMMloo7KIRFUvo3eTyi4gqWQfECsOB
cfd73myFeTTyz8HtY5IqQq1TRAvVt8FJRyJnHMt2CZtSq4Yi161iuw==
=91Sv
-----END PGP SIGNATURE-----



This archive was generated by hypermail 2.1.2 : Tue Apr 09 2002 - 17:03:48 EDT