RE: CJK combining components

Date: Wed Oct 18 2000 - 08:12:49 EDT

Doug Ewell wrote:
> Marco Cimarosti <> wrote:
> > Carl W. Brown:
> >> An article in the October 12, 2000 issue of Linux Weekly News
> >> <> tries to explain the benefit...
> Actually, that quote from Linux Weekly News came from me, not Carl.
> (I'm not trying to take credit for the research, just deflecting any
> criticism away from Carl.)

My mistake, sorry. And thanks to Doug for providing this info.

However, I was not criticizing that article -- nor defending GCS! --, but
rather annoying the list (once more!) about the pros and cons of CJK
characters seen as atomic units, as opposed to composed graphemes.

This topic is so boring probably because it is a chicken-egg problem: a CJK
ideograph is in fact a "character", just like any alphabetic letter is, but
it is also a "compound" that can be analyzed in smaller elements, pretty
like the jamos in a Hangul syllable, or the letters (and diacritics) in a

David Starner wrote:
> If you can decompose the CJK characters into pieces and automatically
> recompose them, what stops you from doing that for Unicode?

Yeah! Nothing can stop me! (Well, apart maybe time and budget
considerations, and the fact that I am not in the fonts business -- but
that's nobody's problem :-)

> The only problem is that you have to decompose the Unicode CJK
> characters yourself, and you still have the table look ups,
> but there's no need to carry around a huge font.

OK. But, in a hypothetical encoding by components, this look up wouldn't be
necessary at all.

And in a hypothetical "mixed" encoding (i.e., having both precomposed
ideographs and combining elements), it would only be needed for
normalization (i.e. when you want the text to be either all precomposed or
all decomposed).

> Even if you have to work with preexisting Unicode technology,
> you could still make the font using that technology instead of doing
> everything by hand.

Yes, I see your point: provided that ideographic decomposition really has
some utility, this utility is not necessarily in the encoding.

This is true, and a good point, but not necessarily a definitive argument
against the theoretical possibility of a decomposed encoding.

Compatibility with the existing practice is the only argument that convinces
me (sort of) that Unicode provides the best possible encoding for CJK

_ Marco

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT