Re: Encoding Personal Use Ideographs

From: James Kass (thunder-bird@earthlink.net)
Date: Sun Nov 04 2007 - 00:23:33 CST

  • Next message: Jukka K. Korpela: "Re: logos, symbols, and ligatures (RE: Encoding Personal Use Ideographs)"

    Philippe Verdy wrote,

    > James Kass wrote:
    > > When the IDS order is top-to-bottom, the appropriate IDC is used (⿱).
    > > When the IDS order is bottom-to-top, then correct IDCharacters exist : ⿶
    > > and, possibly ⿺.
    > > (...)
    > > But, 峯 # ⿶夆山. "⿶夆山" would not be a valid IDS.
    >
    > In 峯, it's clear that using the ⿶ IDC for describing it is not correct.
    > The second component 山 is not embedded within the first one, but
    > really stacked on top of it. So we can only describe it from top-to-bottom
    > using the ⿱ IDC, and then the encoding order in the IDS is reversed,
    > and not logical. This is a defect in this case, and one would need to have
    > another variant of the ⿱ IDC whose order is reversed from bottom to
    > top.

    Characters composed of components which are stacked vertically
    are always written top-to-bottom. An IDS for any such character
    will always match that order. This is not a defect and there would
    be no use in IDS for an IDC with reversed bottom to top order.

    I apologize if I was unclear.

    Incidentally, characters composed of components aligned side-by-side
    are always drawn starting with left-most component. The IDS order
    follows the written order in this case, as well.

    > Using ⿶ or ⿺ for this in the IDS is really a hack: if it preserves the
    > logical order, it does not correctly encodes the correct description.
    >
    > There's only one conclusion: the IDS and the logical ideographs do not
    > encode the same thing. The mapping between the two is not one-to-one,
    > but often one-to-many, or many-to-one, or many-to-many; when
    > these exceptions can count as far as 20% of the existing encoded
    > ideographs, we can really conclude that it is best to always avoid
    > the existing IDS.
    >
    > May be it will be possible to have better IDS that allow one-to-one
    > mappings, but this won't be possible without adding new IDC
    > characters to exhibit more properties: the effective layout of a
    > representative character that is uniquely identifiable, even if it
    > has several other presentations, that would also have their unique
    > IDS; the choice between the IDS to use for the same ideograph
    > would be mostly a matter of localization (notably between
    > Simplified and Traditional Chinese, but also within Modern
    > Japanese, or regional Japanese dialects, or historical variants).
    >
    > I'm still convince that it will be possible to have a one-to-one
    > mapping between a future IDS standard and all ideographs,
    > if the mapping incorporates locale selectors: this locale selector
    > would allow to select which IDS is representative of a given
    > ideograph, which other IDS are considered equivalent, and
    > which other IDSs are equivalent in one locale but not in another
    > (so that the distinctive subsets would require the encoding of
    > variant selectors for these ideographs, to disambiguate the
    > cases).
    >
    > In fact I do think that the only need for registering variants
    > for ideographs, is to allow distinctions between groups of possible
    > IDS-represented glyphs. Other variants that are only
    > typographical don't need to be registered, as long as there's no
    > distinction in some CJKV language or dialect.

    Experts are studying the ideographic description characters in
    an effort to correct any deficiencies. Likewise, experts are
    studying character components which are not yet encoded
    as single characters in Unicode.

    Leaving locale issues aside, the reason given for registering
    variation sequences for CJK is to give users the option of
    preserving variant forms in plain text of items which would
    otherwise be unified.

    If IDS use would accomodate roughly 80% of CJK characters,
    and if Unicode allows applications to form glyphs for IDSequences,
    and if users need to represent as-yet-unencoded or never-to-be-
    encoded "characters" right now in plain text, is there a problem
    in using IDSequences to do so?

    If people seek a compositional model for forming Chinese characters
    in computer text, and one exists in the form of IDS (however
    imperfect), is there anything wrong with using IDS for the
    80% of the cases which IDS can cover?

    And, if IDS are used in this fashion, would the pressure to encode
    potentially tens of thousands of more ideographs be lessened? In
    other words, could 80% of as-yet-unencoded characters be covered
    with IDS and never need to be encoded at all, leaving only 20% which
    would have to be assigned code points? And, likewise, could 80%
    of future proposals for CJK variation sequences be handled well
    with IDSequences?

    Best regards,

    James Kass



    This archive was generated by hypermail 2.1.5 : Sun Nov 04 2007 - 00:27:54 CST