Re: Unicode Myths

From: Mark Davis (mark@macchiato.com)
Date: Fri Apr 12 2002 - 12:30:13 EDT

Previous message: Ben Monroe: "Re: Please help: Unicode sig in Hotmail"
In reply to: Peter_Constable@sil.org: "Re: Unicode Myths"
Next in thread: Rick Cameron: "RE: Unicode Myths"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

You raise a couple of issues in your mail, one having to do with
terminology and the other having to do with the disposition of
surrogates.

A. Terminology.

The term "noncharacter" is jargon that refers to a small set of code
points (66 in fact, see
http://www.macchiato.com/unicode/statistics.htm; I recently added
sample characters to that page also).

This is to be distinguished from code points that are *not interpreted
as characters*. The latter are listed in C4-C6
http://www.unicode.org/unicode/uni2book/ch03.pdf (amended by
http://www.unicode.org/unicode/reports/tr27/#conformance).

In the more precise terminology of code point/unit (rather than the
older code value), this list would be:

a. surrogate code points: U+D800..U+DFFF
b. noncharacter code points: U+FFFF, etc.
c. unassigned code points

There is a bit of fuzziness around the term 'unassigned'. None of the
above three types of code point have characters assigned to them. The
surrogate and noncharacter code points are permanently reserved, and
can't ever--now or in the future--have code points assigned to them,
whereas the unassigned can have code points assigned to them in the
future.

Unfortunately, the UCD has a slightly different cut. It has always
equated Cn with 'unassigned', and Cn includes noncharacters. So the
UCD sense of 'unassigned' includes both (b) and (c) above.

Clearly we need to clean up this terminology a bit. There is a
proposal to change the terminology so that 'unassigned' means
'unassigned, to characters'. In that case, (a), (b), and (c) would all
be 'unassigned'. A new term, 'undesignated', would replace (c). In
that case, we would change the definition of Cn to be 'noncharacter or
undesignated'.

(Myself, I find the use of the term 'unassigned code point' as
characterizing (a), (b), and (c) to be counter-intuitive. If I have a
set of X's, and a subset that I called unassigned-X's, I would expect
the latter to be not designated for any use whatsoever. I'd rather
find another term for 'unassigned-to-characters'. And 'anticharacter'
sounds too apocalyptic! ;-)

The term 'noncharacter' would be the obvious choice--as indicated by
your email!--but is taken by (b); a bit unfortunate--it is also not a
great term for that set of characters, since there are other code
points that are 'not characters'. Given the actual usage, (b) would
perhaps be better called internal-use characters. They are not to be
interchanged, but can be used internally; for example, as sentinel
values. That is, I'd prefer the following taxonomy:

1. characters (those code points associated with abstract characters)
  a. letters
  b. numbers
  ...
2. noncharacters (those code points not associated with abstract
characters)
  a. surrogate code points U+D800..U+DFFF
  b. internal-use code points U+FFFF, etc.
  c. unassigned code points

B. Surrogates

I can understand your point of view, and had Unicode not developed the
way it did, I might agree with you. The editorial committee is trying
to come up with recommendations to the UTC about how to clarify the
situation in U4.0 (but, as Ken said, this is *very* preliminary -- the
editorial committee has yet to agree, let alone the UTC).

The issue is that surrogate code units (and the reservation of the
corresponding code points -- code positions in 10646) were
specifically designed to allow for smooth interoperabilty between
UCS-2 and UTF-16. See 5.5 in
<http://www.unicode.org/unicode/uni2book/ch05.pdf>. Any sequence of
code units from 0000 to FFFF are allowed in UCS-2. Correspondingly,
isolated surrogate code points are specifically allowed in UTF-16. It
is thus perfectly legal to have a Java char datatype or String object
(or C# equivalents), which is specified to hold Unicode code units, to
have an isolated surrogate code unit in it. Code like:

myString.append(myCodeUnit);

myCodeUnit = myString.charAt(3);

are perfectly conformant, even though myCodeUnit could conceivably
contain a surrogate code unit. Ken's suggested text tightens up the
definition of UTFs, but we need to work through all the ramifications
so that such usage in Java, C# or other 16-bit implementations remains
conformant!

It would be perfectly reasonable, on the other hand, to make
modifications such as to amend C10 allow the deletion of surrogate
code points, thus allowing their removal whenever converting to or
from UTF's (like UTF-8) that do not support them. For example, the
[...] text could be included below.

C10 A process shall make no change in a valid coded character
representation other than the possible replacement of character
sequences by their canonical-equivalent sequences or the deletion of
noncharacter code points [or surrogate code points], if that process
purports not to modify the interpretation of that coded character
sequence.

Mark
http://www.macchiato.com

----- Original Message -----
From: <Peter_Constable@sil.org>
To: <unicode@unicode.org>
Sent: Thursday, April 11, 2002 15:16
Subject: Re: Unicode Myths

>
> Mark:
>
> A suggestion: On slide 5, I would be inclined not to differentiate
> surrogates from non-characters. That only confuses people, I think,
> regarding the relationships between codepoints and the various
encoding
> forms. Even if they are formally still distinguished in the Std, I
contend
> that they really should *not* be, and that from the perspective of
novices
> trying to make sense of Unicode, surrogates should be discussed
*only* in
> terms of code units in the context of a discussion of UTF-16.
>
>
> - Peter
>
>
> --------------------------------------------------------------------
-------
> Peter Constable
>
> Non-Roman Script Initiative, SIL International
> 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> Tel: +1 972 708 7485
> E-mail: <peter_constable@sil.org>
>
>
>
>
>
> On 04/06/2002 04:21:24 PM "Mark Davis" wrote:
>
> >Thanks to the many people who suggested Myths. I have posted a new
> >version on
> >
> >http://www.macchiato.com/slides/UnicodeMyths.ppt
> >
> >with new ones included after slide 8.
> >
> >It still needs a bit of work yet. In particular, if someone can get
me
> >a list of the 7 turtles or many grass radicals, that would make a
good
> >example. I also want to reorder them.
> >
> >Any other suggestions on it are welcome; including tongue-in-cheek
> >ones!
> >
> >Mark
> >
> >
>
>
>

Previous message: Ben Monroe: "Re: Please help: Unicode sig in Hotmail"
In reply to: Peter_Constable@sil.org: "Re: Unicode Myths"
Next in thread: Rick Cameron: "RE: Unicode Myths"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Fri Apr 12 2002 - 10:57:23 EDT