Re: Problems/Issues with CJK and Unicode

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Apr 07 2000 - 18:53:16 EDT


Jon Babcock wrote:

> My problems with CJK Unicode are
>
> 1) that often one Han 'character' is mapped to two or three or
> more code points. IOW, CJK unification didn't go far enough.

The CJK encoding in the Unicode Standard is *not* meant to be
the definitive lexicographical treatment of Han characters (zi4).
It is meant to serve as the basis for a practical computer text
encoding for text making use of Han characters. As such it made
a number of practical choices in the unifications to allow it
to coexist with other widespread East Asian legacy encodings:

  A. Distinctions maintained in the original, normative source
     sets (themselves major national standards, mostly) were
     maintained in Unicode via the source separation rule.

  B. Compatibility characters were also added for round-trip
     conversion to other sources, despite the manifest
     appearance of such characters as "duplicates" under the
     basic unification rules.

  C. Traditional/simplified distinctions were maintained as
     separate characters, despite the known relations between
     such characters.

As for many other aspects of the architecture of the Unicode
Standard, aiming for "perfection" here -- in this case, unifying
the Han characters completely, regardless of their status in
legacy source sets -- would have so maimed the standard as to
render it useless for its intended purpose.

This position is not merely a quirk of the UTC. The way the
unification was defined was the result of the work of the IRG,
chaired by China and with participants from all the East Asian
countries using Han characters, as well as the U.S.

>
> 2) that a Han 'character', i.e. a lexicographic unit (lexeme, a
> dictionary entry), is confused with a 'character' of Latin
> script, or of a syllabic script, like kana.

Even this statement is controversial. Chinese lexicography works
at different levels. The lexicographic unit of a dictionary of
zi4 (U+5B57) is a zi4, but that does not mean that a zi4 equates
to a lexeme. The lexicographic unit of a dictionary of ci2 (U+8FAD
= U+8F9E) is a ci2, and that *does* correlate fairly closely
with a lexical word, or lexeme.

Phonologically, the zi4 corresponds quite closely to the syllable
in Chinese, as well as in most of the languages which borrowed
Chinese vocabulary, since they borrowed the Chinese syllables
along with the word(s) and the characters used to write them.

Semantically, the zi4 corresponds reasonably well with the
morpheme in Chinese, although Chinese has a significant number
of bisyllabic morphemes than are written with pairs of
characters. In the borrowing languages, the situation gets
more complex, since for multimorphemic word borrowings, it is
difficult to say just how far the borrowing language
remorphologizes the Chinese borrowing and makes the morphemes
part of their own stock for further lexical building. Many
times they do, but in other instances they do not.

Orthographically, the zi4 can clearly be considered to have
graphemic status. They are clearly units of the writing
system(s), as demonstrated by the long history of writing and
typography using them. On the other hand, all users of the
system are aware that there are significant graphic pieces of
the characters, and of those pieces, the radicals surely have
in and of themselves, graphemic status. The other pieces are
more debatable as to their status. The "phonetics" of Chinese
characters do not have obvious status. Strokes, on the other
hand, are taught, enumerated, and have significant status
since they are used as an access mechanism for dictionaries,
as well as being the writing units for the characters themselves.

> A character of a
> Latin script or a syllabic script goes to make up a
> lexicographic unit (a word, usually). The corresponding
> animal for Chinese would be the graphemes that go to make up
> the lexicographic unit. (The best choice for these might be
> the 2000 or so hemigrams (half graphs) that either stand
> alone as a 'lexeme' or combine with each other to compose all
> Chinese 'lexemes'.)

The "hemigrams" (which, by the way, do not exhaust all the
pieces you would need to combine to construct all the zi4) have
little claim to status as graphemes, compared to the zi4 themselves.

>
> 3) Because the elements of the script (the graphemes or the
> hemigrams) were not encoded as the 'characters' of Chinese,
> the majority (only in terms of quantity, not frequency of
> use) of Chinese lexemes cannot be represented by Unicode
> without recourse to the private use area and even then, there
> will still be thousands left out.

This claim is not supportable. With the recent addition of
Vertical Extension A and the imminent addition of Vertical Extension B
in Plane 2, the obvious and even most nonobvious sources of Chinese
characters (zi4), modern and classical, will have been nearly
all represented. In addition to all the common-use, modern characters
found in the main segment of Unified Han characters, Vertical
Extension B now fills out all the major lexicographical sources:
Kangxi Dictionary, Han Yu Da Zi Dian, Ci Yuan, Ci Hai, Hanyu Da Cidian,
the Chinese Encyclopedia, the Siku Quanshu, characters from CNS
11654, planes 4, 5, 6, 7, and 15, and additional Han characters from
Hong Kong, Japan, Korea, and Vietnam, as well as others.

Furthermore, the claim that "the majority of Chinese lexemes cannot
be represented by Unicode without recourse to the private use area"
is simply wrong because of the incorrect usage of the term "lexeme"
there. But even if you were to substitute "ideograph" for "lexeme"
above, the claim is not tenable, given the inclusion of Vertical
Extension B, to pick up most of the remaining rare, mistaken,
variant, and otherwise strange forms, along with a relatively small
number of current use characters omitted from the URO and Vertical
Extension A.

--Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:01 EDT