Lars Marius Garshol followed up:
> Hmmmm. OK. So combining diacritics are also abstract characters?
Yes, clearly.
Each encoded character in the Unicode CCS, ipso facto, associates
an abstract character with a code point.
So U+0300 COMBINING GRAVE ACCENT associates the code point U+0300
with an abstract character {grave accent mark that attaches above
a base form}. We have agreed to associate, further, a normative
name COMBINING GRAVE ACCENT with that encoded character, to
facilitate transcoding U+0300 COMBINING GRAVE ACCENT with any other
CCS which might include the same abstract character.
> (I
> was also unclear on ZWNJ and similar things, but you explicitly
> mention that above, so...)
Yep, they're all abstract characters, by the nature of the beast.
>
> | (Note above -- abstract characters are also a concept which applies
> | to other character encodings besides the Unicode Standard, and not
> | all encoded characters in other character encodings automatically
> | make it into the Unicode Standard, for various architectural
> | reasons.)
>
> Right. So VIQR, for example, also has abstract characters, then?
Yes, I think the character encoding model is broad enough to apply
to any CCS, not just the Unicode Standard. That's basically why we
can transcode between Unicode and legacy standards.
> However, it does raise a new problem. Isn't the definition of 'string'
> in the XPath specification then wrong?
>
> Strings consist of a sequence of zero or more characters, where a
> character is defined as in the XML Recommendation [XML]. A single
> character in XPath thus corresponds to a single Unicode abstract
> character with a single corresponding Unicode scalar value (see
> [Unicode]); [...]
> <URL: http://www.w3.org/TR/xpath#strings >
>
> As far as I can tell, one of these two claims must be wrong. That is,
> either a single XPath character does not necessarily correspond to a
> single Unicode abstract character, or else a single XPath character
> need not correspond to a single scalar value.
No, I think it is correct, if somewhat convoluted. Basically, it
is trying to say that each XPath character corresponds to a single
Unicode encoded character. By my discussion in the previous note,
a Unicode encoded character maps a (single) abstract character to
a (single) code point [and because of the constraints of the UTF
definitions, the only valid code points are the Unicode scalar
values].
>
> Does that sound reasonable?
The trick for understanding Unicode -- once you have the
basic encoding principle down -- is to realize that the ideal
one-to-one mapping situation is violated in practice for a variety
of reasons. In the clarification about abstract characters that
I quoted earlier in this thread, I abridged further discussion
about all the edge cases. For those strong of stomach, read on.
--Ken
================ formulation of CCS and exceptions ============
What is the CCS?
The CCS is a function f.
The domain X of the function is a repertoire of abstract characters.
The codomain Y of the function is the codespace.
For each x [abstract character in the repertoire], f(x)
[a code point in the codespace] is assigned to x.
Ideally, the CCS is one-to-one, since two different abstract characters [x]
cannot be mapped to the same code point [f(x)], and two different
code points are not assigned to the same abstract character.
The CCS is not onto, since there are many code points which are not
assigned to any abstract character.
In other words, conceptually a CCS is an injection. What we keep
doing is expanding the domain (but not the subdomain) of the injection.
(I.e., we keep adding encoded characters, but don't expand the
codespace.)
In actuality, the Unicode CCS is messier. As we know, it is not really
one-to-one in practice:
a. There are instances where two different abstract characters
are mapped to the same code point. These are the "false
unifications", and usually engender some more-or-less
vociferous discussion about requiring a disunification.
(Cf. U+0192, which unified f-hook with the florin sign.)
There are also legacy overunifications which result in
intentionally ambiguous characters: U+002D hyphus,
U+0027 apostroquote, U+0060 gravquote, and the like.
False unifications engender controversy in inverse proportion
to the length of time people have lived with their ambiguities,
so Unicode tends to be more smothered by nitpicking about
the examples which are introduced by Unicode itself.
b. There are instances where the same abstract character is
mapped to more than one code point. These are the "false
disunifications", and constitute the set of compatibility
characters that we give singleton canonical mappings to
in UnicodeData.txt. The most notorious of these cases is
A-ring/Angstrom, but by far the largest set of these
is located in (where else?) Han -- the Han compatibility
characters. Note also that the Han characters in the
unified set that are distinguished by the source separation
rule are also false disunifications, but ones we live with
and do not provide canonical mappings for since they are
practical for roundtripping to Asian standards that maintain
the false disunifications. But to outsiders, the distinction
between these two types of false disunifications must seem
impenetrably mysterious.
c. When we move beyond assignation of code points per se
(i.e. "encoding") to consider the representation of text
as sequences of encoded characters, it turns out that
many abstract characters also have alternative representations
either as single encoded characters or as a sequence of
encoded characters. These are the "precomposed characters"
and are indicated in the standard by the presence of
a canonically equivalent sequence (>1 character). This doesn't
impact the functional definition of the CCS per se, but it
creates a significant layer of complexity in textual
representation (i.e., in *use* of the standard), and is the
fundamental reason why a normalization algorithm is required.
d. Conversely, there are instances where an abstract character,
for one technical reason or another, has been represented
piecewise in legacy practice. These are the "glyph parts",
and their presence in the standard is indicated by
derogatory language about their encoding. ;-) The classic
example of this is the upper and lower integral parts.
Each glyph part becomes an abstract character by virtue of
being "that which is encoded" in the standard, but the
intent is that the actual abstract character of which they
are pieces be represented by appropriate cellwise or other
layout mechanisms. No formal equivalence mappings or
normalizations are provided for glyph parts, since the
text model for their use lies outside the scope of the
linear/logical text model assumed by Unicode for text
representation.
Note that the line between c) and d) is, however,
rather arbitrary, and seems to be defined mostly by
productivity, rather than any truly axiomatic criteria.
In some sense C and cedilla are just glyph parts of the
abstract character C-cedilla, but we choose to consider them
(proper) abstract characters in their own right because of
their productive usage and patterning.
e. Furthermore, there are many lurking, hidden layers of
equivalence between various abstract characters. Some of these
are historic in nature, some are the result of established
practice, and some are just as yet undiscovered mistakes. Now
that Extension B Han has been added, we have an extremely
fertile ground for this kind of issue. Note, for example,
Richard Cook's identification of U+382F as a kaishu form
of a seal form of U+4EE5, which is also written as U+5DF2.
There is **lots** more where that came from in Extension B,
and the more historic scripts that get encoded, the more
the edge case identification issues are going to bite us.
The issue for the encoding is that after the abstract
characters are formally defined in the standard by virtue
of the encoding, people may change their minds about the
identification of the abstract character(s) involved, come
to question one or more aspects of the encoding, and require
various new kinds of equivalencing and normalization to get
around what should have been a straightforward one-to-one
encoding in the first place.
So while, in principle, the CCS is an injection, in practice it
is actually a fairly messy mapping that takes all kinds of legacy
practice into account. Because of the size of the Unicode
Standard, a small percentage departure from the ideal one-to-one
state can still represent a large absolute number of divergences
that we have to deal with in implementations. (And in explanations
in the standard, for that matter!)
This archive was generated by hypermail 2.1.2 : Tue Jul 23 2002 - 16:53:56 EDT