IDS rendering and IDS analysis (was RE: Problems/Issues with CJK and Unicode)

From: Marco.Cimarosti@icl.com
Date: Mon Apr 10 2000 - 08:12:27 EDT


Jon Babcock wrote:
> And the Description
> Characters (p.565) call attention to the various two-dimensional
> arrangements that the hemigrams may take. This is very handy. If
> this were extended to the phonetic hemigrams at some point in
> the future, then you could represent all kanji with a couple
> thousand code points. I found that possibility attractive.

Perhaps someone new to the Unicode List may need a little background about
this? Sorry for causing the majority to yawn.

In Unicode 3.0 there is a new thing that sounds very much like an analysis
mechanism for CJK ideographs: the Ideographic Description Sequences.

Officially, IDS is not an analysis mechanism at all, because it cannot
(should not) be used to "decompose" ideographs that are already encoded. Its
intended usage, as the name states, is to provide interim "descriptions" for
ideographs that are not (yet) encoded in Unicode. I.e., it is a sort of
small meta-language to describe the shape of rare Chinese characters
avoiding the ambiguity of natural language description.

The core of IDS is a handful of prefix operators (called IDC's:
http://charts.unicode.org/Web/U2FF0.html) that describe the geometric
relationship between two or three operands (side by side, stacked
vertically, etc.). Each operand may be, recursively, another IDS, or one of
the encoded ideographs. It can also be one of the ad-hoc components provided
for this purpose (called "radicals":
http://charts.unicode.org/Web/U2F00.html and
http://charts.unicode.org/Web/U2E80.html).

The first time I enquired about this on the Unicode List, the task of
"understanding" and IDS (i.e. to parse it and figure out what the described
ideograph actually looks like) was explicitly on the human reader of the
text.

However, the 3.0 book and other sources now timidly mention "IDS rendering".
This means that an Unicode display engine has the faculty (but *not* the
obligation!) to generate a glyph on-the-fly, and display it in place of the
IDS itself.

This naturally leads to Jon's thought: what if IDS, or a similar mechanism,
is generalized to all CJK characters? Wouldn't it be possible to encode any
CJK text with only a handful of combining logical units? Or, alternatively,
wouldn't it be possible to design "light" CJK fonts, containing only glyphs
for the basic graphic units?

John Cowan wrote:
> A list of all phonetics would certainly be useful, and might
> (I speak without any authority) be considered by UTC/WG2 for
> inclusion in a future version of the standard.

Lists of phonetics are certainly interesting for scholars studying the
history of Chinese and its writing. And indeed, you can find such
(tentative) lists on many books about this subject. (Among the most
interesting researchers of the past, there was the Swedish scholar Bernard
Karlgren, who used phonetic components as one of the keys to reconstruct the
pronunciation of Ancient Chinese).

But I don't see how such a list could be of any use for (forgive the term) a
"synchronic" usage of CJK characters. If what you want is to encode the
structure of CJK characters, or build CJK fonts, or design and use CJK input
methods, or just learn the ideographs, you don't need to bother whether the
"horse" component (U+2FBA or U+99AC) that you see in a particular ideograph
is a "radical", a "phonetic", or something else.

What I am trying to say is that there are several different ways to analyze
CJK ideographs, and none of them is one-size-fits-all. Depending on what you
want to do, you may choose different:

1) Atomic units. E.g.: "strokes", "components", "hemigraphs" (aka "side
components"), etc.

2) Analysis methods. E.g.: following or not the calligraphic order,
privileging or not some components over others (the so-called "radical");
acknowledging or not the "phonetic" or "ideographic" nature of some
components; considering or not the geometric position of elements, etc.

John Cowan wrote:
> Furthermore,
> anyone who wanted to undertake the heroic effort of generating
> decompositions for all the Unicode 3.0 hanzi would pile up
> massive amounts of Uni-geek cred.

How much is an Uni-geek cred in euros? Depending on the exchange rate, event
some anti-heroes could be interested in the business.

John Cowan wrote:
> However, I think there is rather less agreement about a canonical
> list of all phonetics than there is about radicals?

Just because radicals are used to sort dictionaries. For this purpose, a
list of 214 "radicals" was standardized centuries ago, and is still
sometimes used today.

But, out of the realm of lexicography, the concept of "radical" is very very
fuzzy. Many Western scholars prefer the term "signific", to mean any
component (not necessarily the dictionary's radical) that is though to
convey meaning. Far Eastern scholars traditionally use more sophisticated
categories to analyze ideographs: e.g. pictographic compounds, abstract
logical compounds, categorized phonetic compounds, phonetic loans,
misspellings, etc.

I personally prefer to use the generic term "component", and to pretend I
was not listening when someone asks me to come up with a precise definition
for it :-)

_ Marco



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:01 EDT