Chinese Hemigram Analysis (was Re: When is glyph decomposition warranted?)

From: Edward Cherlin (edward.cherlin.sy.67@aya.yale.edu)
Date: Sun Aug 29 1999 - 20:23:39 EDT


At 02:54 -0700 8/29/1999, Jon Babcock wrote:
>This subject touches on points, related to the Chinese script, that I
>have been mulling over since joining the list a few years ago.

You're definitely not the only one.

>Note
>that although it commands the majority of code points in the Unicode
>standard, Chinese still can not be fully represented using this
>standard. And this is not the omission of a few rare details, but the
>inability to represent thirty thousand Chinese graphs that are already
>found in the lexicons, plus any newly invented graphs of the future.
[snip]

<But I
>wonder if part of the problem in dealing with Chinese has not been
>confusion over this question, "When is glyph decomposition warranted?"

>Dean Snyder writes:
>
>> I tentatively suggest then, for a human language encoding scheme such as
>>Unicode (ignoring for the moment the graphic and dingbat symbol areas),
>>that glyph decomposition based upon purely visual criteria is, in
>>general, not useful, whereas glyph decomposition based upon linguistic
>>criteria MAY be useful. And the decision whether to decompose or not will
>>be based both on one's definition of "utility" and on the levels of
>>meaningful discreteness desired in the encoding.
[snip]

>For some future version of the Unicode standard, it would be nice if the
>big job of hemigramic analysis were carried out so that all the
>hemigrams of Chinese that were not already in Unicode could be
>included. Then Unicode could be used to indicate any Han character
>behind any Han glyph, even newly invented ones. In other words, it could
>be used to fully represent the Chinese script.

I first encountered research on defining CJK fonts by means of hemigrams in
1990, when I was writing a market research study of non-Latin fonts and
font technology. URW was demonstrating the idea at the Seybold conference.
It was not clear then, and it is not clear now, how well algorithmic
rendering could be done on a sequence of hemigrams plus positioning data
and hints. It was also not clear how to carry out some parts of the
analysis, given the breadth of styles in the historic record and in current
use. Some of us would like bronze, seal, and oracle bone fonts in addition
to the usual brush styles. And then, what about "grass" calligraphy?

>As Peter A. Boodberg wrote 45 years ago,
>
>"The number of graphemes [of the Chinese script] runs from 500 to 800,
>estimated on a purely graphic [visual] basis, and to over 2000, if
>reckoned on an organic-structural, historical, and phonosemantic basis.
>These form in bidimensional combinations a graphicon of some 50,000
>graphs or lexigrams (of which only about 10,000 are in common use.)"
>_Cedules from a Berkeley Workshop in Asiatic Philology_, 015-541120.
>[Out of print.]
>
>I think a case can be made that these 2000 or so Chinese graphemes are
>what could be found "useful, both for cultural and computational
>reasons" and that future versions of Unicode would benefit by supporting
>their use in the decomposition of Chinese glyphs.
>
>Jon
>
>--
>Jon Babcock <jon@kanji.com>

I would have uses for such graphemes, and others, for example in
documenting shape-based IMEs such as Cangjie, or describing Chinese
etymology. I have a copy of _Cangjie Shurufa Step by Step_ (ISBN
957-708-551-2) right here on my desk, so let's see. On page 63 is a chart
showing the 24 basic Cangjie characters, and for each one, the graphic
elements it can represent. The complete list contains 109 shapes. These
shapes are used throughout the book wherever characters are dissected and
then mapped to Cangjie sequences. Of course, these shapes are not all of
the same kind as the hemigrams you are asking for.

IMHO, convincing the entire computer industry to represent CJK data in
graphemes can only be done if someone comes out with an implementation that
clearly matches the quality of precomposed forms and saves labor, disk and
RAM space, and especially money, and then licenses it on favorable terms.

Convincing committees to add these graphemes to Unicode might be much less
difficult, but would still require evidence of an implementation in actual
use, with verifiable benefits, that could pass a scholarly review.

--
Ed Cherlin    <edward.cherlin.sy.67@aya.yale.edu>
"Everything should be made as simple as possible,
_but no simpler_."  Attributed to Albert Einstein



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT