Re: The result of the plane 14 tag characters review.

From: John H. Jenkins (
Date: Thu Nov 14 2002 - 10:49:08 EST

  • Next message: Markus Scherer: "Re: IBM AIX 5 and GB18030"

    On Wednesday, November 13, 2002, at 12:07 AM, George W Gerrity wrote:

    > In an effort to unify all character and pictographs, the decision was
    > made to unify CJK characters by suppressing most variant forms. That
    > turns out to be the single greatest objection from users -- especially
    > Japanese -- and somehow we need a low-level way of indicating the
    > target language in the context of multilingual text.
    > The plane 14 tags seem to be appropriate to do this, giving a hint to
    > the font engine as to a good choice of alternate glyphs, where
    > available.

    A couple of points.

    1) There are two kinds of variant problems coming out from Unihan. The
    way objections are stated based on these variant problems is,

    Japanese readers will be forced to read Japanese text with Chinese


    Mr. Watanabe won't be able to insert the variant glyph for his name
    that he prefers into a document!

    The first objection is, and always has been, a non-issue, and is the
    only aspect of the problem that the Plane 14 tags could hope to deal
    with. The issue is not a language one, but a locale one, to begin
    with. Moreover, the typical practice in Japanese typography (at least)
    is to use Japanese-preferred glyphs even when displaying Chinese text.
    Japanese users do *not* expect the text to switch back-and-forth
    between Chinese and Japanese glyphs as the language varies.

    Given this, the best solution to the problem is to use fonts aimed at
    the specific locale. This means that a Japanese user who goes to read
    her email at an Internet café in Hong Kong may see things unexpectedly,
    true, but it really handles 99.99+% of the problem.

    I should note that as Unicode-based systems are becoming more common in
    Japan, such as Windows XP and Mac OS X, there is less concern being
    expressed on this point.

    The second objection could not be solved by the Plane 14 tags. The two
    solutions that are possible are to separately encode every glyphic
    variant which someone, somewhere, sometime may find necessary to
    distinguish in plain text, or to use variant markers. It is the latter
    solution which the UTC has adopted.

    2) From a technical standpoint, the Plane 14 tags do not really lend
    themselves to use with the main complex script font engines available.
    I don't know enough about Graphite to really speak to it, but in the
    case of OpenType and AAT it is true that protocols are already
    available to use Japanese/SC/TC/Korean/Vietnamese glyphs for a run of
    text. These existing protocols, however, depending on information
    external to the text itself.

    To keep the information internal to the text, or, more accurately,
    internal to the glyph stream, one would have to have the ability to
    enter a state once a certain character (or glyph) is encountered and
    remain in that state indefinitely. Neither OpenType nor AAT allow
    this. OpenType does not use a state engine internal to the glyph
    stream for processing, and AAT resets the state at the beginning of
    each line.

    What would have to happen is that the rendering engine would have to
    find these characters within the text stream, massage the text data so
    as remove them and mark the text with the equivalent higher-level
    information, and then render the result.

    The problem here is that the libraries such as Uniscribe and ATSUI
    which provide Unicode rendering do not deal with the text as a whole
    (at least, this is definitely true with ATSUI and is probably true with
    Uniscribe, although I don't know for sure). That is, the Plane 14 tag
    may be found in the first paragraph of the text, but when the client
    hands the text off to the library, they may hand off only a later
    portion because that's all that needs to be drawn. The library then
    does not have access to this information and will not render the text

    This basically means that the onus is on the client to parse the
    presence of these tags in the text and make appropriate adjustments
    when it hands off the text to Uniscribe or ATSUI for rendering. As
    such, there is no real advantage gained by having these tags embedded
    directly in the text over having them in the same layer as font, point
    size, and other typographic preferences. Indeed, it becomes
    inconvenient to have them in a different layer as it means that the
    client has to do *two* levels of processing to derive this information,
    rather than just one.

    John H. Jenkins

    This archive was generated by hypermail 2.1.5 : Thu Nov 14 2002 - 11:35:29 EST