Re: Korean compression (was: Re: Ternary search trees for Unicode dictionaries)

From: Mark Davis (mark.davis@jtcsv.com)
Date: Sun Nov 23 2003 - 13:24:59 EST

  • Next message: mjabbar@bangla.net: "Re: How can I have OTF for MacOS"

    >Of course, no compression format applied to jamos could
    > even do as well as UTF-16 applied to syllables, i.e. 2 bytes per
    > syllable.

    This needs a bit of qualification. An arithmetic compression would do better,
    for example, or even just a compression that took the most frequent jamo
    sequences. Perhaps the above is better phrased as 'no simple byte-level
    compression format...'.

    Mark
    __________________________________
    http://www.macchiato.com
    ► शिष्यादिच्छेत्पराजयम् ◄

    ----- Original Message -----
    From: "Doug Ewell" <dewell@adelphia.net>
    To: "Unicode Mailing List" <unicode@unicode.org>
    Cc: "Jungshik Shin" <jshin@mailaps.org>; "John Cowan" <jcowan@reutershealth.com>
    Sent: Sat, 2003 Nov 22 22:53
    Subject: Korean compression (was: Re: Ternary search trees for Unicode
    dictionaries)

    > Jungshik Shin <jshin at mailaps dot org> wrote:
    >
    > >> The file they used, called "arirang.txt," contains over 3.3 million
    > >> Unicode characters and was apparently once part of their "Florida
    > >> Tech Corpus of Multi-Lingual Text" but subsequently deleted for
    > >> reasons not known to me. I can supply it if you're interested.
    > >
    > > It'd be great if you could.
    >
    > Try
    > http://www.cs.fit.edu/~ryan/compress/corpora/korean/arirang/arirang.txt
    > first. If that doesn't work, I'll send you a copy. It's over 5
    > megabytes, so I'd like to avoid that if possible.
    >
    > >> The statistics on this file are as follows:
    > >>
    > >> UTF-16 6,634,430 bytes
    > >> UTF-8 7,637,601 bytes
    > >> SCSU 6,414,319 bytes
    > >> BOCU-1 5,897,258 bytes
    > >> Legacy encoding (*) 5,477,432 bytes
    > >> (*) KS C 5601, KS X 1001, or EUC-KR)
    > >
    > > Sorry to pick on this (when I have to thank you). Even with
    > > coded character set vs character encoding scheme distinction aside
    > > (that is, we just think in terms of character repertoire), KS C 5601/
    > > KS X 1001 _alone_ cannot represent any Korean text unless you're
    > > willing to live with double width space, Latin letters, numbers and
    > > punctuations (since you wrote the file has apparently full stops and
    > > spaces in ASCII, it does include characters outside KS X 1001) On the
    > > other hand, EUC-KR (KS X 1001 + ISO 646:KR/US-ASCII) can. Actually, I
    > > suspect the legacy encoding used was Windows codepage 949(or JOHAB/
    > > Windows-1361?) because I can't imagine there is not a single syllable
    > > (that is outside the charater repertoire of KS X 1001) out of over 2
    > > million syllables
    >
    > Sorry, I should have noticed on Atkin and Stansifer's data page
    > (http://www.cs.fit.edu/~ryan/compress/) that the file is in EUC-KR. All
    > I knew was that I was able to import it into SC UniPad using the option
    > marked "KS C 5601 / KS X 1001, EUC-KR (Korean)".
    >
    > >> I used my own SCSU encoder to achieve these results, but it really
    > >> wouldn't matter which was chosen -- Korean syllables can be encoded
    > >> in SCSU *only* by using Unicode mode. It's not possible to set a
    > >> window to the Korean syllable range.
    > >
    > > Now that you told me you used NFC, isn't this condition similar to
    > > Chinese text? How does BOCU and SCSU work for Chinese text? Japanese
    > > text might do slightly better with Kana, but isn't likely to be much
    > > better.
    >
    > Well, *I* didn't use NFC for anything. That's just how the file came to
    > me. And yes, the situation is exactly the same for Chinese text, except
    > I suppose that with 20,000-some basic Unihan characters, plus Extension
    > A and B, plus the compatibility guys starting at U+F900, one might not
    > realistically expect any better than 16 bits per character. OTOH, when
    > dealing with 11,171 Hangul syllables interspersed with Basic Latin, I
    > imagine there is some room for improvement over UTF-16.
    >
    > I'm intrigued by the improved performance of BOCU-1 on Korean text, and
    > I'm now interested in finding a way to achieve even better compression
    > of Hangul syllables, using a strategy *not* much more complex than SCSU
    > or BOCU and *not* involving huge reordering tables. Your assistance,
    > and anyone else's, would be welcome. Googling for "Korean compression"
    > or "Hang[e]ul compression" turns up practically nothing, so there is a
    > chance to break some new ground here.
    >
    > John Cowan <cowan at mercury dot ccil dot org> responded to Jungshik's
    > comment about Kana:
    >
    > > The SCSU paper claims that Japanese does *much* better in SCSU than
    > > UTF-16, thanks to the kana.
    >
    > The example in Section 9.3 would appear to substantiate that claim, as
    > 116 Unicode characters (= 232 bytes of UTF-16) are compressed to 178
    > bytes of SCSU.
    >
    > Back to Jungshik:
    >
    > >> Only the large number of spaces and full
    > >> stops in this file prevented SCSU from degenerating entirely to 2
    > >> bytes per character.
    > >
    > > That's why I asked. What I'm curious about is how SCSU and BOCU
    > > of NFD (and what I and Kent [2] think should have been NFD with the
    > > possible code point rearragement of Jamo block to facilate a smaller
    > > window size for SCSU) would compare with uncompressed UTF-16 of NFC
    > > (SCSU/BOCU isn't much better than UTF-16). The back of an envelope
    > > calculation gives me 2.5 ~ 3 bytes per syllable (without the code
    > > point rearrangement to put them within a 64 character-long window [1])
    > > so it's still worse than UTF-16. However, that's not as bad as ~5
    > > bytes (or more) per syllable without SCSU/BOCU-1. I have to confess
    > > that I just have a very cursory understanding of SCSU/BOCU-1.
    >
    > When this file is broken down into jamos (NFD), SCSU regains its
    > supremacy:
    >
    > UTF-8: 17,092,140 bytes
    > BOCU-1: 8,728,553 bytes
    > SCSU: 7,750,957 bytes
    >
    > And you are correct that SCSU (and for that matter, BOCU-1) performance
    > would have been better if the jamos used in modern Korean had been
    > arranged to fit in a 128-character window (64 would not have been
    > necessary). As it is, SCSU does have to do some switching between the
    > two windows. Of course, no compression format applied to jamos could
    > even do as well as UTF-16 applied to syllables, i.e. 2 bytes per
    > syllable.
    >
    > -Doug Ewell
    > Fullerton, California
    > http://users.adelphia.net/~dewell/
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Sun Nov 23 2003 - 14:02:28 EST