Re: Abstract character?

From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Jul 23 2002 - 18:41:49 EDT


Lars Marius Garshol followed up:

> Hmmmm. OK. So combining diacritics are also abstract characters?

Yes, clearly.

Each encoded character in the Unicode CCS, ipso facto, associates
an abstract character with a code point.

So U+0300 COMBINING GRAVE ACCENT associates the code point U+0300
with an abstract character {grave accent mark that attaches above
a base form}. We have agreed to associate, further, a normative
name COMBINING GRAVE ACCENT with that encoded character, to
facilitate transcoding U+0300 COMBINING GRAVE ACCENT with any other
CCS which might include the same abstract character.

> (I
> was also unclear on ZWNJ and similar things, but you explicitly
> mention that above, so...)

Yep, they're all abstract characters, by the nature of the beast.

>
> | (Note above -- abstract characters are also a concept which applies
> | to other character encodings besides the Unicode Standard, and not
> | all encoded characters in other character encodings automatically
> | make it into the Unicode Standard, for various architectural
> | reasons.)
>
> Right. So VIQR, for example, also has abstract characters, then?

Yes, I think the character encoding model is broad enough to apply
to any CCS, not just the Unicode Standard. That's basically why we
can transcode between Unicode and legacy standards.

> However, it does raise a new problem. Isn't the definition of 'string'
> in the XPath specification then wrong?
>
> Strings consist of a sequence of zero or more characters, where a
> character is defined as in the XML Recommendation [XML]. A single
> character in XPath thus corresponds to a single Unicode abstract
> character with a single corresponding Unicode scalar value (see
> [Unicode]); [...]
> <URL: http://www.w3.org/TR/xpath#strings >
>
> As far as I can tell, one of these two claims must be wrong. That is,
> either a single XPath character does not necessarily correspond to a
> single Unicode abstract character, or else a single XPath character
> need not correspond to a single scalar value.

No, I think it is correct, if somewhat convoluted. Basically, it
is trying to say that each XPath character corresponds to a single
Unicode encoded character. By my discussion in the previous note,
a Unicode encoded character maps a (single) abstract character to
a (single) code point [and because of the constraints of the UTF
definitions, the only valid code points are the Unicode scalar
values].
 
>
> Does that sound reasonable?

The trick for understanding Unicode -- once you have the
basic encoding principle down -- is to realize that the ideal
one-to-one mapping situation is violated in practice for a variety
of reasons. In the clarification about abstract characters that
I quoted earlier in this thread, I abridged further discussion
about all the edge cases. For those strong of stomach, read on.

--Ken

================ formulation of CCS and exceptions ============

What is the CCS?

The CCS is a function f.

The domain X of the function is a repertoire of abstract characters.

The codomain Y of the function is the codespace.

For each x [abstract character in the repertoire], f(x)
[a code point in the codespace] is assigned to x.

Ideally, the CCS is one-to-one, since two different abstract characters [x]
cannot be mapped to the same code point [f(x)], and two different
code points are not assigned to the same abstract character.

The CCS is not onto, since there are many code points which are not
assigned to any abstract character.

In other words, conceptually a CCS is an injection. What we keep
doing is expanding the domain (but not the subdomain) of the injection.
(I.e., we keep adding encoded characters, but don't expand the
codespace.)

In actuality, the Unicode CCS is messier. As we know, it is not really
one-to-one in practice:

   a. There are instances where two different abstract characters
      are mapped to the same code point. These are the "false
      unifications", and usually engender some more-or-less
      vociferous discussion about requiring a disunification.
      (Cf. U+0192, which unified f-hook with the florin sign.)
      There are also legacy overunifications which result in
      intentionally ambiguous characters: U+002D hyphus,
      U+0027 apostroquote, U+0060 gravquote, and the like.
      False unifications engender controversy in inverse proportion
      to the length of time people have lived with their ambiguities,
      so Unicode tends to be more smothered by nitpicking about
      the examples which are introduced by Unicode itself.

   b. There are instances where the same abstract character is
      mapped to more than one code point. These are the "false
      disunifications", and constitute the set of compatibility
      characters that we give singleton canonical mappings to
      in UnicodeData.txt. The most notorious of these cases is
      A-ring/Angstrom, but by far the largest set of these
      is located in (where else?) Han -- the Han compatibility
      characters. Note also that the Han characters in the
      unified set that are distinguished by the source separation
      rule are also false disunifications, but ones we live with
      and do not provide canonical mappings for since they are
      practical for roundtripping to Asian standards that maintain
      the false disunifications. But to outsiders, the distinction
      between these two types of false disunifications must seem
      impenetrably mysterious.

   c. When we move beyond assignation of code points per se
      (i.e. "encoding") to consider the representation of text
      as sequences of encoded characters, it turns out that
      many abstract characters also have alternative representations
      either as single encoded characters or as a sequence of
      encoded characters. These are the "precomposed characters"
      and are indicated in the standard by the presence of
      a canonically equivalent sequence (>1 character). This doesn't
      impact the functional definition of the CCS per se, but it
      creates a significant layer of complexity in textual
      representation (i.e., in *use* of the standard), and is the
      fundamental reason why a normalization algorithm is required.

   d. Conversely, there are instances where an abstract character,
      for one technical reason or another, has been represented
      piecewise in legacy practice. These are the "glyph parts",
      and their presence in the standard is indicated by
      derogatory language about their encoding. ;-) The classic
      example of this is the upper and lower integral parts.
      Each glyph part becomes an abstract character by virtue of
      being "that which is encoded" in the standard, but the
      intent is that the actual abstract character of which they
      are pieces be represented by appropriate cellwise or other
      layout mechanisms. No formal equivalence mappings or
      normalizations are provided for glyph parts, since the
      text model for their use lies outside the scope of the
      linear/logical text model assumed by Unicode for text
      representation.

      Note that the line between c) and d) is, however,
      rather arbitrary, and seems to be defined mostly by
      productivity, rather than any truly axiomatic criteria.
      In some sense C and cedilla are just glyph parts of the
      abstract character C-cedilla, but we choose to consider them
      (proper) abstract characters in their own right because of
      their productive usage and patterning.

   e. Furthermore, there are many lurking, hidden layers of
      equivalence between various abstract characters. Some of these
      are historic in nature, some are the result of established
      practice, and some are just as yet undiscovered mistakes. Now
      that Extension B Han has been added, we have an extremely
      fertile ground for this kind of issue. Note, for example,
      Richard Cook's identification of U+382F as a kaishu form
      of a seal form of U+4EE5, which is also written as U+5DF2.
      There is **lots** more where that came from in Extension B,
      and the more historic scripts that get encoded, the more
      the edge case identification issues are going to bite us.
      The issue for the encoding is that after the abstract
      characters are formally defined in the standard by virtue
      of the encoding, people may change their minds about the
      identification of the abstract character(s) involved, come
      to question one or more aspects of the encoding, and require
      various new kinds of equivalencing and normalization to get
      around what should have been a straightforward one-to-one
      encoding in the first place.

So while, in principle, the CCS is an injection, in practice it
is actually a fairly messy mapping that takes all kinds of legacy
practice into account. Because of the size of the Unicode
Standard, a small percentage departure from the ideal one-to-one
state can still represent a large absolute number of divergences
that we have to deal with in implementations. (And in explanations
in the standard, for that matter!)



This archive was generated by hypermail 2.1.2 : Tue Jul 23 2002 - 16:53:56 EDT