Re: Furigana

From: Doug Ewell (dewell@adelphia.net)
Date: Mon Aug 12 2002 - 23:09:28 EDT


Kenneth Whistler <kenw at sybase dot com> wrote:

>> Surely all Unicode/10646 characters are expected to be preserved in
>> interchange. What have I got wrong, Ken?
>
> Your expectation that this stuff will actually work that way.
>
> Yes, the characters will be preserved in interchange. But the
> most likely result you will get is:
>
> <anchor1>text<anchor2>annotation<anchor3>
>
> where the anchors will just be blorts. You should not expect that
> the whole annotation *framework* will be implemented, and certainly
> not that these three characters will suffice for "nice[ly] marked
> up... furigana".

I don't have any problem with the idea that many, or even all, of
today's applications lack meaningful support for ideographical
annotation characters, and will display them as blorts, and I doubt that
Michael expects widespread support for them either. What worries me is
what Ken saus next:

> These animals are more like U+FFFC -- they are internal anchors
> that should not be exported, as there is no general expectation
> that once exported to plain text, a receiver will have sufficient
> context for making sense of them in the way the originator was
> dealing with them internally.
>
> By rights, this whole problem of synchronizing the internal anchor
> points for various ruby schemes should have been handled by
> noncharacters -- but that mechanism was not really understood
> and expanded sufficiently until after the interlinear annotation
> characters were standardized.

This moves the entire issue out of the realm of poor support and into
the big, dark, scary cavern of pre-deprecation.

Unicode 3.0 doesn't say exactly what Ken says. Unicode 3.0 (p. 326)
says the annotation characters should only be used under "prior
agreement between the sender and the receiver because the content may be
misinterpreted otherwise." Fine, no problem; those are the same rules
that apply to the PUA. Ken, though, seems to say they shouldn't be
exported at all, and furthermore they shouldn't even have been encoded
in the first place, except that the noncharacters (which explicitly
mustn't be interchanged) hadn't been invented yet.

This sounds like Plane 14, or the combining Vietnamese tone marks, all
over again -- Unicode (and/or WG2) invents a mechanism, but then wishes
they hadn't, or thinks of a better way, so the mechanism is "strongly
discouraged" and eventually deprecated. (Not that I liked the separate
Vietnamese tone marks; don't get me wrong.)

Some groups, like IDN and the security mavens, criticize Unicode for its
perceived "instability." A lot of the attention seems to revolve around
gray areas of normalization and bidi, or confusable glyphs (what I call
"spoof buddies"). Can I suggest that a potentially larger source of
instability comes from the creation of characters and encoding
mechanisms that are subsequently discouraged or deprecated because maybe
they weren't fully thought out in the first place? The approval process
in Unicode, and especially WG2, is a slow one, and some of these "on
second thought" decisions race ahead of the approval process, so that
the mechanisms are already doomed by the time of publication.

Everybody will welcome the new conventional, graphical-type characters
and scripts that are coming with Unicode 4.0. But maybe before
standardizing another COMBINING GRAPHEME JOINER or other control-type
character, it would be prudent to study the angles even more thoroughly
and carefully, and make *damn* sure the character is going to be usable
and not discouraged or even deprecated at birth.

(No, I have never been involved in the character standardization
process -- but I *have* been on committees that encoded other types of
things too hastily and then had to find a way to "take back" their
decision.)

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Mon Aug 12 2002 - 21:19:19 EDT