Re: Error in definition of "compatibility character"?

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Oct 26 2001 - 15:24:58 EDT


David Hopwood said:

> I think the correct definition of a compatibility character is a
> character with a compatibility decomposition that differs from its
> canonical decomposition (i.e. NFKC(c) != NFC(c)). Am I right?

Actually, what you mean here is NFKD(c) != NFD(c), which is
implicitly what Mark Davis was agreeing with. There is no reason
to get the *re*composition (and composition exclusions) of
NFKC and NFC mixed into the pot, too.

>
> (Note that it wouldn't be correct to define a compatibility character
> simply as a character that has "<...> ..." entry in the decomposition
> field of the UCD; a counterexample is U+03D3.)

This whole issue of what is a "compatibility character" has
gotten increasingly more convoluted over the years.

First of all, as Mark pointed out, there are two quite distinct
usages of the term in the standard currently.

1. (decomposition) compatibility character

  That is what D21 is about, and is derived on the basis of
  the presence or absence of compatibility decompositions.

2. (legacy) compatibility character

  These are characters that were included in the standard for
  compatibility with other standards, for crossmapping, or
  for other legacy interoperability reasons. Sometimes they
  have compatibility mappings, sometimes they have canonical
  mappings (see, e.g., all the CJK compatibility ideographs),
  and sometimes they have no mappings to other Unicode characters.

The text of the standard is being rewritten to make the distinction
between these two uses of the term clear.

But as you have pointed out, there is a fuzziness in the sense
of (decomposition) compatibility character, since the decompositions
of characters are the result of recursive application of the
decomposition mappings defined in UnicodeData.txt.

Since canonical mappings can decompose to elements that
have compatibility mappings, or vice versa, the sets of
characters defined by applying the full decompositions,
i.e.,

  the set of all c where (NFD(c) != c)

  the set of all c where (NFKD(c) != c) and (NFDK(c) != NFD(c))

is not exactly the same as the sets of characters defined
by the type of decomposition mappings in UnicodeData.txt, i.e.,

  the set of all c where there exists a decomposition mapping
      in UnicodeData.txt and that decomposition mapping does
      *not* have a compatibility formatting tag

  the set of all c where there exists a decomposition mapping
      in UnicodeData.txt and that decomposition mapping does
      have a compatibility formatting tag

In my opinion, rather than just "fixing" the D1 definition
of "compatibility character" to match one or the other
of these, we need a further clairification of the distinctions,
and if necessary new terminology to make it easier to know
which of these sets we are talking about.

--Ken



This archive was generated by hypermail 2.1.2 : Fri Oct 26 2001 - 16:18:01 EDT