Component Based Han Ideograph Encoding (WAS: Level of Unicode support required for various languages)

From: Ed Trager (ed.trager@gmail.com)
Date: Sat Oct 27 2007 - 11:28:06 CDT

  • Next message: Michael Maxwell: "RE: thorn vs. y or th, eth and other similar letters/signs"

    Hi, everyone,

    Although a component-based system of encoding Han ideographs clearly
    did not happen --and is not going to happen-- in the Unicode Standard,
    there is no reason why such a system and standard could not be now
    devised --along with reference implementations-- by an enterprising
    community of people worldwide interested in creating a new, possibly
    competing, and certainly less-limiting future standard for the
    encoding of textual information using Han ideographs.

    One can rather easily imagine an Open Source-style project which would
    set out to define a new and independent standard for encoding Han
    ideographs based on their components and the relative positioning of
    those components.

    Any ideographs so encoded which map to ideographs currently encoded in
    Unicode could simply be rendered using existing Unicode CJK fonts
    which already contain the relevant "precomposed" glyphs.

    As for those ideographs not yet encoded in Unicode, or those rare
    historical or modern oddities and variants which will never be encoded
    in Unicode, such a system would need to provide a "composing engine"
    capable of doing at least a half-decent job at composing ideographs
    from the set of base components. Writing such an engine would be a
    great challenge, which might make it even more likely to actually
    happen, as smart people everywhere on the planet generally enjoy a
    good challenge :-) .

    Such a "composing engine" could eventually be tied into existing or
    future text layout and font rasterizing engines, thus allowing
    noodle-eaters everywhere to be able to write about how tasty that dish
    of "biang2 biang2" noodles* they had yesterday was, or parents to name
    their cute babies using uniquely cute ideographs invented by
    themselves, or enterprising marketeers to gain marketshare by
    inventing new ideographs for their "As Seen On TV" products.

    Of course there would be many important real-world and scholarly
    applications if such a standard and system existed too. :-)

    (* http://en.wikipedia.org/wiki/Biang_biang_noodles )

    -- Ed Trager

    > On Oct 25, 2007, at 11:41 PM, vunzndi@vfemail.net wrote:
    >
    > An even more effcient solution as far as code points, would have
    > been to encode the components of Chinese characters, not precomposed
    > charcters, this would take up over 10 thousand code points to encode
    > the current 70 thousand unicode charcters, and include over 80% of
    > all CJKV submissions. In this case new submissions would be
    > resticted to new components. This way all cjkv would be in the BMP.
    >

    On 10/27/07, vunzndi@vfemail.net <vunzndi@vfemail.net> wrote:
    > Dear Gerrit,
    >
    > IMHO you are correct, the biggest obstacle was not technical, but
    > other factors.
    >
    > John
    >
    > Quoting Gerrit Sangel <z0idberg@gmx.de>:
    >
    > > Excuse me if I am wrong, but according to Wikipedia, the original Cangjie
    > > method mastered this in the 80s or so. And I do not think the computer at
    > > that time were really sophisticated.
    > >
    > > Could it not have been solved like the ligatures in TeX? I mean, TeX masters
    > > some features other apps still cannot do now.
    > >
    > > I think, a possibility would have been to store the text like ?
    > > (U+5973) and ?
    > > (U+99AC) and generate ? (U+5ABD) via some kind of ligatures. This could then
    > > be stored in the font, which describes that if ? is followed by ? and a
    > > character for ?next character? it should generate ?.
    > >
    > > This could have then spanned the ordinary CJK range, but if some kind
    > > of ?unknown? character is typed in, it could still be stored (maybe in a more
    > > inferior quality in display, but still it would not have needed a code
    > > point).
    > >
    > > Regards
    > > Gerrit Sangel
    > >
    > > Am Freitag 26 Oktober 2007 schrieb John H. Jenkins:
    > >> it would
    > >> have required technical support beyond the abilities of then-current
    > >> systems, it would have made East Asian texts take even *more* space
    > >> than they do now and made them more difficult to process.
    > >



    This archive was generated by hypermail 2.1.5 : Sat Oct 27 2007 - 11:29:59 CDT