From: Doug Ewell (dewell@adelphia.net)
Date: Wed Dec 03 2003 - 11:32:07 EST
Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:
> I still think that we could try to use only LV syllables but not LVT
> syllables to reduce the set of Hangul character used if this helps
> the final compressor.
Aha, LV syllables. Now we are talking about something that exists and
can be used in the manner you describe. It won't help SCSU or BOCU-1
compression, but it might improve the performance of a Huffman or
arithmetic implementation that can handle more than 256 characters, as
you stated below.
> It's true that the LV syllables are discontinuous in the large Hangul
> johab syllable block. But it could reduce the number of needed codes
> in compression lookup dictionnaries and would limit the number of
> table resets by exhausting less often the lookup table, and it would
> also allow finding compressable similarities in the text stream at
> much shorter distances than within a text using a lot of LVT
> syllables. So the impact of the spreaded LV syllables in the johab
> set would still be low.
True. Don't try it with SCSU, though, because you'd be constantly
jumping between single-byte and Unicode mode (or using four bytes for
every LVT syllable). And don't try it with BOCU-1, because every switch
between the jamos block and the syllable block will cost three bytes.
> I will retry to compress Korean by using NFC form modified by
> excluding LVT johab syllables but only keeping LV johab syllables and
> separate L or V or T jamos...
UAX #15 includes sample Java code showing, among other things, how to
compose an LV syllable plus a T jamo into a syllable. It would be
relatively easy to reverse the logic, though of course the UAX does not
show that because it is neither NF(K)C nor NF(K)D.
Speaking of which, I just noticed that the function in SC UniPad to
compose syllables from jamos does not handle this case (LV + T = LVT).
I'll have to report that to the UniPad team.
> I just have another question for Korean: many jamos are in fact
> composed from other jamos: this is clearly visible both in their name
> and in their composed glyph. What would be the linguistic impact of
> decomposing them (not canonically!)? Do Korean really learn these
> jamos without breaking them into their components? I think here about
> SSANG (double) consonnants, or the initial Y or final E of some
> vowels...
This would be a good question for Jungshik or another native Korean. I
have read that Korean children learn the syllables as whole units,
rather than as an arrangement of jamos as I would see them, leading some
to think of Hangul as a featural syllabary instead of an alphabet.
> Of couse I won't be able to use such decomposition in Unicode, but
> would it be possible to use it in some private encoding created with a
> m:n charset mapping from/to Unicode?
You can do absolutely anything you like in a private encoding. Bernard
Miller did:
-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/
This archive was generated by hypermail 2.1.5 : Wed Dec 03 2003 - 12:22:46 EST