Re: Compression through normalization

From: Doug Ewell (dewell@adelphia.net)
Date: Wed Dec 03 2003 - 11:32:07 EST

Next message: Stefan Persson: "Re: MS Windows and Unicode 4.0 ?"

Previous message: Arcane Jill: "Free Fonts"
In reply to: Philippe Verdy: "RE: Compression through normalization"
Next in thread: Philippe Verdy: "Re: decomposable Hangul jamos (was: Compression through normalization)"
Reply: Philippe Verdy: "Re: decomposable Hangul jamos (was: Compression through normalization)"
Reply: Jungshik Shin: "Re: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

> I still think that we could try to use only LV syllables but not LVT
> syllables to reduce the set of Hangul character used if this helps
> the final compressor.

Aha, LV syllables. Now we are talking about something that exists and
can be used in the manner you describe. It won't help SCSU or BOCU-1
compression, but it might improve the performance of a Huffman or
arithmetic implementation that can handle more than 256 characters, as
you stated below.

> It's true that the LV syllables are discontinuous in the large Hangul
> johab syllable block. But it could reduce the number of needed codes
> in compression lookup dictionnaries and would limit the number of
> table resets by exhausting less often the lookup table, and it would
> also allow finding compressable similarities in the text stream at
> much shorter distances than within a text using a lot of LVT
> syllables. So the impact of the spreaded LV syllables in the johab
> set would still be low.

True. Don't try it with SCSU, though, because you'd be constantly
jumping between single-byte and Unicode mode (or using four bytes for
every LVT syllable). And don't try it with BOCU-1, because every switch
between the jamos block and the syllable block will cost three bytes.

> I will retry to compress Korean by using NFC form modified by
> excluding LVT johab syllables but only keeping LV johab syllables and
> separate L or V or T jamos...

UAX #15 includes sample Java code showing, among other things, how to
compose an LV syllable plus a T jamo into a syllable. It would be
relatively easy to reverse the logic, though of course the UAX does not
show that because it is neither NF(K)C nor NF(K)D.

Speaking of which, I just noticed that the function in SC UniPad to
compose syllables from jamos does not handle this case (LV + T = LVT).
I'll have to report that to the UniPad team.

> I just have another question for Korean: many jamos are in fact
> composed from other jamos: this is clearly visible both in their name
> and in their composed glyph. What would be the linguistic impact of
> decomposing them (not canonically!)? Do Korean really learn these
> jamos without breaking them into their components? I think here about
> SSANG (double) consonnants, or the initial Y or final E of some
> vowels...

This would be a good question for Jungshik or another native Korean. I
have read that Korean children learn the syllables as whole units,
rather than as an arrangement of jamos as I would see them, leading some
to think of Hangul as a featural syllabary instead of an alphabet.

> Of couse I won't be able to use such decomposition in Unicode, but
> would it be possible to use it in some private encoding created with a
> m:n charset mapping from/to Unicode?

You can do absolutely anything you like in a private encoding. Bernard
Miller did:

http://www.bytext.org/

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

Next message: Stefan Persson: "Re: MS Windows and Unicode 4.0 ?"
Previous message: Arcane Jill: "Free Fonts"
In reply to: Philippe Verdy: "RE: Compression through normalization"
Next in thread: Philippe Verdy: "Re: decomposable Hangul jamos (was: Compression through normalization)"
Reply: Philippe Verdy: "Re: decomposable Hangul jamos (was: Compression through normalization)"
Reply: Jungshik Shin: "Re: Compression through normalization"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Dec 03 2003 - 12:22:46 EST