Re: Unicode Myths

From: Mark Davis (mark@macchiato.com)
Date: Tue Apr 09 2002 - 21:21:54 EDT


BTW, I have a page up on
http://www.macchiato.com/unicode/statistics.htm that shows the current
breakdown of code points according to General Category (and which of
those have different normalizations according to the 4 different
forms).

Mark
—————

Γνῶθι σαυτόν — Θαλῆς
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: "Kenneth Whistler" <kenw@sybase.com>
To: <david.hopwood@zetnet.co.uk>
Cc: <unicode@unicode.org>; <kenw@sybase.com>
Sent: Tuesday, April 09, 2002 16:33
Subject: Re: Unicode Myths

> David Hopwood provided various comments on Mark Davis' Unicode
> Myths slides. I'm sure Mark will respond in some way, but I
> have some counter-comments on the part having a bearing on
> the Unicode character encoding model.
>
> > Slide 5
>
> I'm not sure what you on here about. Mark's Myth #5 was stated as
"Every
> Unicode code point represents a character", and the slide bullets
are
> just talking points towards explanation of the different major
categories
> for the code points; some of them encode characters, and some of
them do
> not.
>
> Granted that the text of Unicode 3.0 is murky about all this. That
is
> all being worked on for Unicode 4.0, to bring the definitions in
line
> with the general framework of the Character Encoding Model. The
revision
> of Chapter 3, in particular, will not be available for UTC review
for
> awhile yet, but much of this is guaranteed to be more clearly stated
> in the next edition.
>
> The short synopsis of the "Standard Model" is:
>
> abstract character
>
> Those entities which are to be encoded. They can be, in
principle,
> anything, from letters of alphabets, to undisplayed format or
other
> control functions, to roundtrip conversion clones. They are "what
> gets encoded".
>
> repertoire
>
> A set of abstract characters.
>
> codespace
>
> A range of nonnegative integers used for encoding. (For Unicode,
> this range is 0..0x10FFFF, inclusive.)
>
> code point
>
> A value within the codespace.
>
> encoding
>
> 1. The process of associating ("mapping") abstract characters
with
> code points.
> 2. The result of associating a particular repertoire with code
> points. (aka coded character set, "CCS")
>
> encoded character
>
> An abstract character together with the code point value it has
> been mapped to.
>
> code unit
>
> A numerical unit associated with a fixed-width data type
(generally,
> 8-bit, 16-bit, or 32-bit, because of computer architecture
> considerations), used in character encoding forms.
>
> character encoding form
>
> A mapping from the set of integers used in a CCS to a set of
sequences
> of code units. (Unicode has 3 encoding forms: UTF-8, UTF-16,
UTF-32.)
>
> Unicode scalar value
>
> Because of the nature of the definition of UTF-16, not all code
points
> in the Unicode codespace can be represented in the Unicode
character
> encoding forms. And because of that, a concept called the Unicode
> scalar value is used; that refers to the subset of integers used
in
> the Unicode CCS, namely 0..0xD7FF, 0xE000..0x10FFFF. The Unicode
scalar
> values are the subset of integers in the Unicode codespace that
> constitute the domain for definition of the 3 Unicode encoding
forms.
>
> Code point categorization
>
> To make sense of the categorization of code points, I make use of
> three concepts: assignment, allocation, and designation.
>
> Assignment refers to the status of a code point as having an
> abstract character associated with it by the standard.
>
> Ordinary encoded characters (code points which have been mapped
> to abstract characters), control characters (code points which
> have been mapped to abstract characters which in turn are
placeholders
> for control functions specified by other standards), and private
> use code points are all *assigned* code points. All *assigned*
> code points may have character properties associated with them,
> since they are associated with abstract characters.
>
> Note that the PUA characters are very funny animals, in that
their
> entire meaning is undefined and they have no names, but they have
> to be treated as assigned characters to make sense. Effectively,
> we have
>
> <<Abstract character for private use 1>> assigned to U+E000
> <<Abstract character for private use 2>> assigned to U+E001
> ...
> <<Abstract character for private use 137469>> assigned to
U+10FFFD
>
> Noncharacter code points, surrogate code points, and reserved
> (no mapping to an abstract character) code points are all
*un*assigned.
>
> Allocation refers to the categorization of code points into types
> and subtypes for assignment.
>
> Allocation can span assigned and unassigned
> code points. Character blocks are pre-allocated spans of code
points,
> conceptually associated with particular groups of characters.
> The Devanagari block is *allocated* to Devanagari. That means
that
> the character encoding committees will only assign Devanagari
script
> characters to the unassigned code points within the allocated
block.
> (Not all blocks are so clear about their allocation semantics,
but
> the general concept should be clear.)
>
> Designation refers to the formal specification of usage to a
> code point by the standard. All assigned code points, as well as
> noncharacter code points and surrogate code points, have their
> usage formally and normatively specified by the standard.
> Reserved code points, on the other hand, are *un*designated as to
> usage. They are simply reserved for future designation, and in
> principle could become any of the other designated types or even
> a new designated type that does not currently exist.
>
> Now, in the context of that statement of the model, let me consider
your
> claims:
>
> > - Unassigned characters are characters (clause C6 in Chapter 3 of
the
> > Standard notwithstanding).
>
> This claim appears to make no sense -- but that is the result of
> your different use of the term "character".
>
> What I can stipulate is:
>
> Unassigned abstract characters are abstract characters.
>
> That is, we don't have to encode an entity to have given it a
> status of "thing to be encoded" as a character. It can be
> an acknowledged member of a repertoire before it is encoded.
>
> Unassigned code points are code points.
>
> This simply means that a code point does not have an abstract
> character mapped to it; it is surely a code point nonetheless.
>
> What I could not stipulate would be:
>
> Unassigned code points are characters.
>
> Unassigned code points are neither abstract characters per se,
> nor do they have abstract characters mapped to them.
>
> > Search the standard for "unassigned character"; it occurs
several
> > times.
>
> This is mostly a careless usage in the earlier text, and in nearly
> all cases will be replaced by "unassigned code point" in future
versions.
>
> > Also, several clauses and definitions are incorrect or
> > incomplete if unassigned code points do not correspond to
characters:
> > at least C9, C10, C13, D1, D6, D7, D9 (which should not
restrict to
> > "graphic characters"), D11, D13, D14, D17..D24, D28, and the
note
> > after D29.
>
> There are various infelicities in some of those clauses and
definitions,
> some of which have been addressed in Unicode 3.1 and Unicode 3.2,
and
> more of which will be clarified in Unicode 4.0. However, I disagree
with
> your main point, since, in principle unassigned code points *cannot*
> "correspond to characters". That is contradictory with the concept
of
> assignment.
>
> > - Format control characters are also characters.
>
> Assuredly, yes. And I don't think Mark was claiming otherwise.
>
> > - Private-use characters are definitely characters.
>
> Also, yes. See above.
>
> > - The values D800..DFFF are not valid code point values,
>
> Incorrect. See above for the distinction between code point values
> and Unicode scalar values.
>
> > they are UTF-16
> > code unit values (the valid Unicode code point space is 0..D7FF
union
> > E000..10FFFF.)
>
> > In computer jargon, "characters" are, by definition, the things
that are
> > enumerated in coded-character-sets (regardless of whether or not
they are
> > displayed as simple spacing glyphs, have control functions, are
not yet
> > assigned, or have any other strange properties).
>
> I would agree with this. This is what the Unicode Standard means by
> "abstract character".
>
> > Apart from the unfortunate
> > "noncharacter" terminology (which would have better called
"internal-use
> > characters"),
>
> No -- internal use *code points*.
>
> > all valid Unicode code points *do* correspond to characters
> > in this sense.
>
> Incorrect. What I think you are trying to say, translated into my
> terminology, is that all Unicode code points aside from surrogate
> code points (your "invalid") and noncharacter code points correspond
> to abstract characters. This is true for *assigned* code points,
> of course, since that is what I *mean* by assigned. But it is not
> true for the reserved code points, which are unassigned -- and which
> thereby cannot be considered to be (associated with encoded)
characters.
>
> Using the term "character" for an unassigned, reserved code point
> just blurs the terminological distinction between character and
> code point unacceptably. U+70000 is assuredly a valid Unicode code
point,
> but it is not a *character* until and unless the UTC and WG2
> assign something to it.
>
> > Note that there is no conflict between this jargon meaning of
"character",
> > and its original meaning as a unit of text.
>
> Well, no conflict if you mean that they are different usages,
applicable
> to different domains of consideration.
>
> But they are certainly con*fus*ing and are commonly confused by
people
> who do not understand how character encoding standards work.
>
> > While we're on this subject, it's also redundant to say "abstract
character":
>
> Nope. It is a deliberate usage to distinguish between
> "character" as entity to be encoded and "character" as encoded
entity.
>
> > *all* characters are abstractions,
>
> Of course. All the better then to identify them as abstract
characters. ;-)
>
> > and the definition of this term (D3 in
> > Chapter 3 of the Unicode Standard) doesn't mean anything different
to plain
> > "character", as defined above.
>
> Nope. Abstract character is a deliberately constrained term.
"Character"
> has multiple, and occasionally ambiguous usages in the text of the
> Unicode Standard and in general discussion about character encoding,
> even by the experts.
>
> > Slide 29
> > - there are 1,112,064 valid Unicode code points, not 1,114,112.
> > (D800..DFFF are not valid code points.)
>
> Nope.
>
> Unicode has 1,114,112 code points.
>
> There are 1,112,064 Unicode scalar values.
>
> --Ken
>
>
>



This archive was generated by hypermail 2.1.2 : Tue Apr 09 2002 - 22:18:23 EDT