Re: Ternary search trees for Unicode dictionaries

From: Jungshik Shin (jshin@mailaps.org)
Date: Sat Nov 22 2003 - 21:50:44 EST

Next message: Philippe Verdy: "numeric properties of Nl characters in the UCD"

Previous message: Philippe Verdy: "Telugu U+0C48, UAX#15 decomposition/canonical recomposition, and identifier-start exclusions"
In reply to: Doug Ewell: "Re: Ternary search trees for Unicode dictionaries"
Next in thread: John Cowan: "Re: Ternary search trees for Unicode dictionaries"
Reply: John Cowan: "Re: Ternary search trees for Unicode dictionaries"
Reply: Doug Ewell: "Korean compression (was: Re: Ternary search trees for Unicode dictionaries)"
Reply: Philippe Verdy: "RE: Ternary search trees for Unicode dictionaries"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Thu, 20 Nov 2003, Doug Ewell wrote:

> Jungshik Shin <jshin at mailaps dot org> wrote:
>
> >> In my experience, SCSU usually does perform somewhat better than
> >> BOCU-1, but for some scripts (e.g. Korean) the opposite often seems
> >> to be true.
> >
> > Just out of curiosity, which NF did you use for your uncompressed
> > source Korean text, NFC or NFD when you got the above result?
> > I guess I'll know in a week or so when your paper is out, but...

Thanks a lot for providing the details.

> although I did reproduce their results. The file they used, called
> "arirang.txt," contains over 3.3 million Unicode characters and was
> apparently once part of their "Florida Tech Corpus of Multi-Lingual
> Text" but subsequently deleted for reasons not known to me. I can
> supply it if you're interested.

It'd be great if you could.

> The file is all in syllables, not jamos, which I guess means it's in
> NFC.

Yes, it's in NFC, then.

> The statistics on this file are as follows:
>
> UTF-16 6,634,430 bytes
> UTF-8 7,637,601 bytes
> SCSU 6,414,319 bytes
> BOCU-1 5,897,258 bytes
> Legacy encoding (*) 5,477,432 bytes
> (*) KS C 5601, KS X 1001, or EUC-KR)

Sorry to pick on this (when I have to thank you). Even with
coded character set vs character encoding scheme distinction aside
(that is, we just think in terms of character repertoire), KS C 5601/KS
X 1001 _alone_ cannot represent any Korean text unless you're willing
to live with double width space, Latin letters, numbers and punctuations
(since you wrote the file has apparently full stops and spaces in ASCII,
it does include characters outside KS X 1001) On the other hand, EUC-KR
(KS X 1001 + ISO 646:KR/US-ASCII) can. Actually, I suspect the legacy
encoding used was Windows codepage 949(or JOHAB/Windows-1361?) because
I can't imagine there is not a single syllable (that is outside the
charater repertoire of KS X 1001) out of over 2 million syllables

> I used my own SCSU encoder to achieve these results, but it really
> wouldn't matter which was chosen -- Korean syllables can be encoded in
> SCSU *only* by using Unicode mode. It's not possible to set a window to
> the Korean syllable range.

Now that you told me you used NFC, isn't this condition similar to
Chinese text? How does BOCU and SCSU work for Chinese text? Japanese text
might do slightly better with Kana, but isn't likely to be much better.

> Only the large number of spaces and full
> stops in this file prevented SCSU from degenerating entirely to 2 bytes
> per character.

That's why I asked. What I'm curious about is how SCSU and BOCU
of NFD (and what I and Kent [2] think should have been NFD with the possible
code point rearragement of Jamo block to facilate a smaller window size
for SCSU) would compare with uncompressed UTF-16 of NFC (SCSU/BOCU isn't
much better than UTF-16). The back of an envelope calculation gives me
2.5 ~ 3 bytes per syllable (without the code point rearrangement to put
them within a 64 character-long window [1]) so it's still worse than
UTF-16. However, that's not as bad as ~5 bytes (or more) per syllable
without SCSU/BOCU-1. I have to confess that I just have a very cursory
understanding of SCSU/BOCU-1.

As for why I'm interested in NFD, one of the reason is that it is easier
to implement incremental search (among many other things) in terms of
NFD. The only __editor__ I've ever seen support incremental search for
Korean per letter (as opposed to per syllable) is Hemacs (Hangul Emacs).
Not having incremental search per letter in Korean is like not being able
to match 'apples/apple/apply', 'interesting/interested', and 'learned'
with 'appl', 'interest' and 'learn'.

Jungshik

[1] By encoding only basic consonants and vowels, we could have put
all Hangul letters in a 64 character block. With all modern consonants
and vowels encoded (whether it's complex or not) and only basic
consonants/vowels encoded for archaic letters, it'd be slightly larger
than to fit into a 64 character block. If we just encode consonants
once (instead of twice - as leading and trailing, which is similar to
Tibetan case), we can make it even more compact.

[2] I found that KIM Kyeong-seok (Korean representative to ISO/IEC
JTC1/SC22/WG20 and JTC1/SC2/WG2) had also proposed to change the
NFD similarly. I'm not sure why it's submitted to JTC1 instead of
UTC (NOT that we can remedy the problem now)

Next message: Philippe Verdy: "numeric properties of Nl characters in the UCD"
Previous message: Philippe Verdy: "Telugu U+0C48, UAX#15 decomposition/canonical recomposition, and identifier-start exclusions"
In reply to: Doug Ewell: "Re: Ternary search trees for Unicode dictionaries"
Next in thread: John Cowan: "Re: Ternary search trees for Unicode dictionaries"
Reply: John Cowan: "Re: Ternary search trees for Unicode dictionaries"
Reply: Doug Ewell: "Korean compression (was: Re: Ternary search trees for Unicode dictionaries)"
Reply: Philippe Verdy: "RE: Ternary search trees for Unicode dictionaries"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Nov 22 2003 - 22:43:19 EST