Compatibility and Politics (was Re: Digraphs as Distinct Logical Units)

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Aug 08 2002 - 20:43:16 EDT


Roozbeh asked:

> > Expecting the compatibility decompositions to serve this purpose
> > effectively is overvaluing what they can actually do.
>
> I would love to hear your opinion about what compatibility decompositions
> *are* for, then. I feel a little confused here.

They are helpful annotations to an earlier version of the standard
that got swept up first by changing expectations and then were caught
in a normative stasis trap by the normalization specification.

Originally, they were a shorthand way of saying things like:

"This character is not really a 'good' Unicode character -- it
should be thought of as a font variant of X."

"This character is not really a 'good' Unicode character -- it
should be thought of as effectively representing the sequence of
X, Y, and Z."

And so on.

The terminology of "compatibility character" confused everyone,
including the people writing the standard, since it meant, on the
one hand, characters that didn't really fit the Unicode text model,
but which were encoded for compatibility with important standards,
for ease of round-trip conversions, mostly. On the other hand, it
came to mean characters that had compatibility decompositions, once
those were officially specified in the Unicode 2.0 publication, since
most "compatibility characters" had "compatibility decompositions".
This situation was further confused by the abortive early attempt to
encode "compatibility characters" in a "compatibility zone", which
resulted in people assuming that if a character was in that zone
it automatically *was* a compatibility character and (later) that
it should also have a compatibility decomposition.

However, compatibility decompositions were originally assigned
pretty much by a seat-of-the-pants method, without a clear
implementation model to guide all of the decisions. As the UTC
approached the critical milestone of Unicode 3.0 (and normalization),
many of the earlier decompositions were refined and further
rationalized, but they still retained some of the helter-skelter
context of their annotational origins.

The intuition was that the compatibility decompositions "sort of"
made sense for such things as fallback, loose comparison (e.g.
for collation and searching), normalizing, and such. However,
when detailed specifications started to be written for such
things, guided by implementation experience, it turned out
that the compatibility decompositions were typically in the
ballpark, as it were, but not correct in detail for any one
purpose, let alone all purposes.

And the publication of UAX #15 Normalization drastically turned
things on their head. Instead of being annotational, and "fixable",
compatibility decompositions became part of the normative
definition of NFKD and NFKC, and became "unfixable", because of
the requirements of normalization stability.

So post-Unicode 3.0, the right way to think of the compatibility
decomposition mappings is as the normative data used to define
NFKD and NFKC. They bear some resemblance to relationships
between characters and character sequences that may be useful
in other processes, but in *all* cases should not be taken as
a sufficiently precise set of classifications and equivalences
for other processes -- there are always going to be exceptions,
particularly since compatibility decompositions can no longer
be "fixed" as a result of tuning based on implementation experience.

> > > providing backup rendering when they lack the glyph,
> >
> > This seems unlikely to be particularly helpful in this *particular*
> > case.
>
> Believe me, it really is. I'm implementing char-cell rendering for Arabic
> terminals, and when it comes to Arabic ligatures, since I don't want two
> get into a mess of double width things, I just decompose than ligature,
> and render the equivalent string. It's not as genuine as may be, but it's
> automatic, simple, clean, and conformant.

For this kind of application, then, you simply add on decompositions
for whatever else cannot be conveniently rendered in a char-cell.
Arabic terminal applications have often already departed from what
the Unicode Standard specifies in the way of compatibility decompositions
by doing special handling of character "tails" in a separate cell,
for example. Note that there isn't any compatibility mapping for
U+FEB1 (isolated seen) --> U+FEB3 (initial seen) + U+FE73 (tail fragment),
even though that might be what a Arabic terminal could do for display.

It isn't non-conformant with the Unicode Standard to transform
Unicode characters to alternate representations -- such as
a glyph stream for terminal rendering -- it would only be
nonconformant to *claim* that such a glyph stream is NFKD data
when it departs from that specification.

> Some other point: We like to discourage the usage of Arabic Presentation
> Forms, don't we?

Of course. They are compatibility characters for working with the
existing legacy code pages that encoded Arabic that way.

> That is mentioned in TUS 3.0 at the end of the chapter
> about Arabic. All the characters in the Arabic Presentation Forms blocks
> have these decompositions, exactly for this. Only three miss it, U+FD3E
> and U+FD3F, Ornate Parentheses, which got there by mistake (and are
> mentioned in the text), and U+FE73, which is a half-character (and could
> not have it in any way).
>
> By not providing a compatibility decomposition, we are making the proposed
> character a healthy and normal characters, just like Arabic letters or
> symbols.

Nope. See my above discussion for the distinctions. Presence or absence
of a compatibility decomposition is not criterial for "this is a 'good'
Unicode character" or "this is a 'bad' Unicode character." There are plenty
of waaaay worse Unicode characters, encoded for a variety of legacy or
even political reasons, but which have no compatibility decompositions.

And some of the characters with compatibility decompositions, such as
U+00A0 NO-BREAK SPACE are considered essential parts of many Unicode
applications -- and nobody seriously considers them to be 'bad' characters.

It is a good idea for people to stop thinking of the presence of
a compatibility mapping as the mark of Cain -- it is more correctly
now just a piece of normative data used in the definition of
normalization in UAX #15.

> It won't be a compatibility character like Chinese and Japanese
> ones, or other Arabic ligatures, but a new beast encouraged to be used.

Correct. It isn't a duplicate of something already encoded, and it
has a reasonable implementation rationale, so it isn't born
"predeprecated", like some of the junk that gets into the standard.
 
> Why don't we encode it in the 06xx block then?

Because it is another word ligature symbol like the others in
the FDFX column, and because there no longer are officially
good neighborhoods and bad neighborhoods in the BMP. Putting like
things with like in the increasingly crowded BMP area is basically
doing a favor to font implementers and builders of character property
tables, for the most part, as well as simplifying the task of
structuring the explanations needed in the documentation of the
standard.

> > > reading a text stream aloud, and things like that,
> >
> > And this requires much more than just some raw access to an
> > NFKD normalization of the text stream to make any sense, for
> > any real application.
>
> Of course, but look how nice it is now: In whatever encoding it is, just
> pass it through a converter to Unicode NFKC, and then you will have
> something very clean and consistent to work with. Why bother with
> difficulties of the various character encodings? This applies to almost
> every similar application which is not rendering-oriented.

NFKC is not "Cleanicode". It has all kinds of problems when
you study it in detail: with respect to compatibility with
markup, with respect to format distinctions which should or
should not be maintained under various circumstances, and with
respect to the various incompatible kinds of foldings that
get applied under the same process. One uses NFKC as a raw processing
form only with great trepidation, since it is easy to destroy a
distinction that you (or the consumer of your data) may assume
was important for preservation.

> > Implementation practice since then has suggested that compatibility
> > decompositions for these Arabic word ligatures used symbolically are not
> > much help -- and if any thing just provoke edge case failures for
> > implementations.
>
> I don't get you. They have definitely been a help to me. What are the
> other difficulties (other than the decomposition buffer size you just
> mentioned)?

As guidelines, sure. I'm not suggesting it was a bad idea in the
first place to indicate all these kinds of character equivalencies
that got associated with various of the compatibility characters
in the standard. You just cannot assume that compatibility mappings
can be used without discrimination and refinement for particular
processes.

Another example of a complication for the decompositions of the Arabic
word ligatures would come from assuming that compatibility decompositions
should be mapped onto input methods. That would be *correct* in the
case of a two-character ligature -- say something like FCA6 THEH WITH MEEM --
one could expect to type THEH ... MEEM ... and then have automatic ligature
formation under certain circumstances. But the word ligature symbols
are different. Nobody really expects to have to type:
0635 0644 0649 0020 0627 0644 0644 0647 0020 0639 0644 064A 0647 0020
0648 0633 0644 0645 and then have ligature formation scoop up the
entire sequence to create the SALLALLAHOU ALAYHE WASALLAM symbol, do
they? No, if you are using such a special word ligature symbolically,
as in a regular header for documents, or such, then you expect to have
the symbol on its own key (or the moral equivalent, thereof). And the
implementers of fonts and rendering systems can't reasonably be expected
to search for these few extraordinary edge cases and deal with them
automatically, when instead they should be focussed on the more regular
2- and 3-element ligatures.
 
> Also, please note that we cannot remove all those decompositions, we can
> only do the implementer's' job a little harder by breaking the model (and
> encouraging the use of Arabic Presentation Forms block), or we can help
> him a little.
>
> > Nope. The UTC wouldn't do it, nor have the Pakistani delegates
> > working with the UTC and WG2 asked for it.
>
> It was in the first proposals, IIRC. They were not formal, of course.

The first "proposals" were actually simply background documents about
UZT, rather than well-formed proposals for actual encodings. Those
came later.

> > Not everything that gets into a national standard gets into Unicode.
>
> Undoubtedly, but being in a national standard helps a lot. So OK, are you
> telling me that it is not just for compatibility, it is a legitimate
> character that could have got accepted by WG2 even if it was not in UZT?

I expect so, actually, given its usage. A similar (but not identical)
BISMALLAH was requested early for the Thaana script, and I expect that
the decision about that will eventually be revisited, as well.

> BTW, whose was the suggestion to not provide a compatibility decomposition
> for the character?

At this point, who can recall who was the first to raise their hand? ;-)
Essentially it was a consensus decision by the committee, with little
dissent that I can recall.

> > But I think you may be overestimating the caving in going on here.
> > The UTC is still pushing back on another proposal to disunify Urdu
> > digits, for example -- those did *not* get accepted by WG2, nor do I
> > expect they will pass muster in future UTC meetings.
>
> Yes, I'm doing that on purpose. I was talking with a Pakistani expert
> about UZT before the Dublin meeting. He told me that his colleagues will
> propose the character to WG2, and I told him it's impossible, WG2 has
> already passed something about not encoding more Arabic ligatures, unless
> it is in a pre-90s standard. He told me: "You only need to push hard. Just
> insist enough, they will surrender".

No one doubts that there is a political aspect to character encoding.
After all, this works in an international context, with lots of competing
interests and individuals in two different large committees.

But it isn't as simple as just pushing and insisting. In the end, you
have to convince two committees that there is *some* technical merit
to the proposal. Totally off-the-wall stuff doesn't get in, no matter
how hard you push. You haven't seen a *real* game of political character
encoding hardball if you haven't seen the 7-member North Korean delegation
at the Beijing WG2 meeting insisting on the complete reencoding and
renaming of all Korean characters in 10646!

> I'm just a geek who prefers technical excellence to political reasons. I
> do try my best for implementing this in our national standard committees,
> and I somehow except this from UTC. I just like to see more of that
> resistance against grass radicals.

:-) I pushed *very* hard to avoid the proliferation of grass radicals
in the standard. In the end, I lost on that one -- and we ended up
with more grass radicals, anyway. I consider that one more chapter in
the sorry history of mistakes in de jure and de facto Japanese encoding
standards. But you win some and you lose some.

--Ken



This archive was generated by hypermail 2.1.2 : Thu Aug 08 2002 - 19:03:38 EDT