RE: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

From: Kent Karlsson (kentk@cs.chalmers.se)
Date: Mon Nov 24 2003 - 06:29:32 EST

  • Next message: Andrew C. West: "Re: creating a test font w/ CJKV Extension B characters."

    ...
    > >> Of course, no compression format applied to jamos could
    > >> even do as well as UTF-16 applied to syllables, i.e. 2 bytes per
    > >> syllable.

    I wonder why Hangul would need compression over and above
    any other alphabetic script... It has already quite a lot of compression
    in the form of precomposed syllables. I think we better start a project
    for allocating precomposed "syllables" for many other scripts,
    precomposed Latin script syllables, precomposed Greek script
    syllables, precomposed Tamil script syllables (most of the Brahmic
    derived scripts are especially disadvantaged, from a 'compression'
    viewpoint by the virama characters), etc. That should take up much
    of the excess space in the unused planes (3-13, decimal).
    Unfortunately that mean 4 bytes per non-Hangul syllable (before
    byte oriented compression is done), but that could be compensated
    by using an SCSU-like approach, just with bigger windows.

            No, this was not serious ;-)
            /kent k

    PS
    Hangul syllables are "LVT" (actually (L+)(V+)(T*)), not TLV.



    This archive was generated by hypermail 2.1.5 : Mon Nov 24 2003 - 07:20:27 EST