-----BEGIN PGP SIGNED MESSAGE-----
Mark Davis wrote:
> Thanks to the many people who suggested Myths. I have posted a new
> version on
>
> http://www.macchiato.com/slides/UnicodeMyths.ppt
>
> with new ones included after slide 8.
Here are my comments.
Slide 5
- Unassigned characters are characters (clause C6 in Chapter 3 of the
Standard notwithstanding).
Search the standard for "unassigned character"; it occurs several
times. Also, several clauses and definitions are incorrect or
incomplete if unassigned code points do not correspond to characters:
at least C9, C10, C13, D1, D6, D7, D9 (which should not restrict to
"graphic characters"), D11, D13, D14, D17..D24, D28, and the note
after D29.
- Format control characters are also characters.
- Private-use characters are definitely characters.
- The values D800..DFFF are not valid code point values, they are UTF-16
code unit values (the valid Unicode code point space is 0..D7FF union
E000..10FFFF.)
In computer jargon, "characters" are, by definition, the things that are
enumerated in coded-character-sets (regardless of whether or not they are
displayed as simple spacing glyphs, have control functions, are not yet
assigned, or have any other strange properties). Apart from the unfortunate
"noncharacter" terminology (which would have better called "internal-use
characters"), all valid Unicode code points *do* correspond to characters
in this sense.
Note that there is no conflict between this jargon meaning of "character",
and its original meaning as a unit of text. The following definition may
be helpful:
An "orthocoding" is a set of rules for representing texts in some language
or symbology as sequences of characters. I.e. an orthocoding relates texts
to sequences of characters, in the same way that an orthography relates
texts to arrangements of glyphs.
When someone asks the FAQ about how to encode Indic half-forms in Unicode,
for instance, they have in mind an orthocoding in which the half-forms are
characters, and are asking how that orthocoding relates to the Unicode
orthocodings for Indic scripts (even if they don't think of the question
in that way). Input methods/keyboard layouts also effectively define an
orthocoding in which each keystroke corresponds to a "character".
The fact that orthocodings not designed for computers normally don't use
control characters (apart possibly for "new line", "new paragraph", etc.)
does not mean that the controls in Unicode, ASCII, etc. are not characters.
The practical effect of defining controls not to be characters would just
be to require awkward constructions like "sequence of characters or controls"
all the time, instead of just "sequence of characters".
While we're on this subject, it's also redundant to say "abstract character":
*all* characters are abstractions, and the definition of this term (D3 in
Chapter 3 of the Unicode Standard) doesn't mean anything different to plain
"character", as defined above.
Slide 8
- the first bullet should say "grapheme cluster != code point".
Slide 13
- the International currency symbol is a poor analogy to a hypothetical
"decimal point" character: that symbol has a defined appearance
different to any specific currency symbol, and currency symbols
shouldn't be automatically changed according to locale anyway, unlike
decimal points (e.g. €1.50 never means the same as $1.50, but "1,5"
can mean the same as "1.5").
- a better argument against encoding a "decimal point" character is that
it isn't distinguished on keyboards (so ',' and '.' would still have
to be interpreted according to context anyway).
Slide 17
- the definition of "compatibility composite" is incorrect.
- support for round-trip transcoding does *not* require encoding of
compatibility characters, in most cases (because only strings need
to be round-tripped, not individual characters). Most compatibility
characters are unnecessary and should not have been encoded.
- what security risks? Any risks due to encoding of "look-alike"
characters have nothing to do with *compatibility* characters per-se;
they occur also for non-compatibility look-alike characters (e.g.
Latin U+0065 'a' and Cyrillic U+0430 'а', which obviously cannot
be sensibly unified).
In any case, IMHO compatibility decomposables are *almost* all bad.
(The decomposable CJK radicals, for example, are not bad, but should
not have compatibility decompositions in the first place.) Whether or
not you agree with this, it is a supportable opinion, not a myth. I'm
sure that I can defend this point of view if anyone wants to discuss it
in more detail.
Slide 24
- a pure 16-bit design was *not* possible, even without composed and
compatibility characters. This should have been recognised from the
start. RFC373 (published in 1972!) correctly estimated that
17 bits were required for a universal coded-character-set. (It might
not sound as though there is very much difference between 16 and
17 bits, but there is.)
Arguably, the 16-bit design was a serious mistake - a variable length
ASCII-compatible encoding (say 1 to 3 bytes, which allows a code space
of ~18 bits with alphabetic scripts in the 2-byte subspace), would
have fit much better into existing practice, and could have been treated
as just another "codepage" extending ASCII, rather than requiring
completely new APIs. (Think about how long it took for Windows 9x to
properly support two API sets.) At the very least, 16 bits was always
going to impose undesirable constraints and compromises.
One of the advantages of having a code space larger than the number of
characters that are actually required, is that it provides enough room
to designate the most important properties (for example, major category,
case, combining class, bidirectional class, and line/word/grapheme break
properties) to sufficiently large unassigned ranges. Unicode doesn't do
this (except for default bidirectional class), but it could have done if
the *original* design had had a large enough code space (>= 18 bits),
and that would have had many advantages.
Slide 29
- there are 1,112,064 valid Unicode code points, not 1,114,112.
(D800..DFFF are not valid code points.)
Slide 33
- the first bullet point is correct - human writing systems are complex.
However, the second bullet point is extremely dubious: there was no
need for multiple byte orders (it would have been perfectly reasonable
to specify a fixed byte order for external representation), multiple
encoding and normalisation forms, or for the vast majority of compatibility
characters. There is a considerable amount of *unnecessary* complexity in
Unicode, that is not imposed by the problem of defining a universal
coded-character-set. Some of that complexity would have been difficult to
avoid without the benefit of hindsight, but some could and should have
been avoided.
- "Yen vs backslash" doesn't belong in the list because it is not a
complexity of Unicode. U+005C is unambiguously backslash; the fact that
0x5C can mean either yen or backslash in, e.g. the IANA "Shift_JIS"
charset, is a problem with the definition of that charset (which could
only be fixed by changing the IANA charsets registry), and has nothing
to do with Unicode.
- --
David Hopwood <david.hopwood@zetnet.co.uk>
Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip
-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv
iQEVAwUBPLNF2TkCAxeYt5gVAQEP+Af+P6NlrhpRaCu/nJnhfippikMK/Yf8mPw9
t4ZNf9b+SKrXPolK3fhLZruGxFJ8doBf7waL2qMyajUzlqDBc6KkltaEDLThl4DM
UNgdJwgJiNpqFsgOP2f6ruRcOfmSrqPT7F0l9rghS+doP6tED/9Kx9xVUNPSA3HN
+iIXl0A+0nnUJvAhV7IaurrH3cWTPUFRNerciHqBA5PXSahqYKvB+JhbjNO+lZYS
4Mc/sM9VNwB6oyzo9sasmywEZMCSOONA2erMMloo7KIRFUvo3eTyi4gqWQfECsOB
cfd73myFeTTyz8HtY5IqQq1TRAvVt8FJRyJnHMt2CZtSq4Yi161iuw==
=91Sv
-----END PGP SIGNATURE-----
This archive was generated by hypermail 2.1.2 : Tue Apr 09 2002 - 17:03:48 EDT