RE: Still can't work out whats a "canonical decomp" vs a "compat ibility decomp"

From: Jungshik Shin (jshin@mailaps.org)
Date: Fri May 09 2003 - 08:33:11 EDT

Next message: Edward C. D. Hopkins: "[Unicode] Suggestion to list owner"

Previous message: Jungshik Shin: "RE: Question ..."
In reply to: Marco Cimarosti: "RE: Still can't work out whats a "canonical decomp" vs a "compat ibility decomp""
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Thu, 8 May 2003, Marco Cimarosti wrote:

> Jarkko Hietaniemi wrote:
> > Another potential Gedankenexperiment would of course be a
> > Cleanencoding, but I guess the WCode is already quite
> > good an attempt in that direction (though I must admit
> > that the WTF encoding makes me grimace a bit :-)
>
> Here is Markus' Wcode, for the benefit of new list members:
>
> http://www.mindspring.com/~markus.scherer/unicode/wcode.html

WCode, as it stands, is not 'clean' enough to me for Korean
script.

     * WCode contains all Unicode characters except ones with a
       decomposition of any kind. Normalization on WCode only sorts
       combining characters in canonical order. (This removes some
       13000(?) characters from the BMP. WCode is mostly Unicode NFKD.)

If I could begin from the scratch, I'd remove all 'cluster Jamos' in
U+1100 block in addition to precomposed Hangul syllables (that are
removed by the above provision). That leaves us with 17 ( + 4 ) leading
consonants, 11 medial vowels and 17 ( + 4 ) trailing consonants along
with leading Jamo filler(U+115F) and vowel filler(U+1160) [1], totalling
55 code points down from over 12,000 code points for Korean script
freeing up a huge amount of code space in BMP for *much better* use. [2]
This has an additional benefit of making SCSU/BCU better suited for
Korean text represented in Jamos because all Jamos can fall within a
single sliding window of SCSU/BCU. It also simplifies collation/sorting.

Jungshik

[1] We can cut down code points further by encoding consonants only
once (and perhaps adding trailing consonant filler). Here we have 35
code points. In this scheme, a regular Korean syllable takes the form
of L+V+T+M? where L,V, and T include fillers. Similar encodings were
used in mid-1980's on Korean Unix systems (before KS C 5601-1987, now
KS X 1001:1998)

[2] WCode already frees up 11,172 code points as it stands, my scheme
gives us back about 180-210 more.

Next message: Edward C. D. Hopkins: "[Unicode] Suggestion to list owner"
Previous message: Jungshik Shin: "RE: Question ..."
In reply to: Marco Cimarosti: "RE: Still can't work out whats a "canonical decomp" vs a "compat ibility decomp""
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri May 09 2003 - 09:22:38 EDT