RE: Looking up han characters

From: Marco.Cimarosti@icl.com
Date: Thu Jun 29 2000 - 15:07:13 EDT


Robert Lozyniak wrote:
> How do I look up a han character if I don't know its
> codepoint? What if all I have is its shape, or its
> EUC-JP or Shift-JIS number? There are a couple I
> want to see.

If you know the value in JIS (or any other encoding), you just need to look
up a conversion table. There are plenty available on the net; the official
Unicode tables for JIS are in:

        ftp://ftp.unicode.org/Public/MAPPINGS/EASTASIA/JIS/

If you just have the glyph (e.g., you see in on a newspaper, or on a
tattoo), then you have a more general problem: "how do I look up a
hanzi/kanji on a dictionary, if I don't know the pronunciation?"

There are several different shape-based indexing methods, but only the
following two are widely used:

1. The stroke count method

Background: the sequence of pen strokes needed to trace each hanzi, as well
as their direction and shape, is codified by the rules of Chinese
calligraphy. These rules have to be strictly observed by everybody, not just
for having a "nice hand", but for the very reason that violating them would
result in unreadable characters.

        a) Count the number of strokes that are needed to trace the
character: all characters having the same count are sorted together in a
specific section.
        b) Identify the type of the first stroke (about eight types: e.g.
horizontal line, vertical line, dot, angle, etc.): within the main
stroke-count section, all characters beginning with that type of stroke are
grouped together.
        c) If there are many characters with the same count and same first
stroke, repeat point (b) for second stroke, etc.
        d) You found it! (Depending on the dictionary, you now have the
romanization or the page number, so you can now go to the body of the
dictionary).

2. The radical method

Background: most hanzi are formed by two "components", which are hanzi
themselves squeezed to fit in a single square. The first component is the
"radical" (or "signific", or "key"), and represents the general meaning of
the compound (e.g. the hanzi for "mama/mother" has a radical "female"
because, broadly speaking, a mother is a woman). The other component is the
"phonetic", and it gives a hint about the pronunciation (e.g., the "mama"
hanzi above has a "horse" phonetic, because both "mama" and "horse" sound
"ma" in Chinese, although with different tones).

        a) Look at the various parts of the hanzi, and identify the radical.
Sadly, there are no precise rules for this (although many radicals are
easily recognized by having a fixed position, which is often the right side
or top half).
        b) Look up the radical in the radical index. There is no standard
list of radicals: each dictionary has its own choice; however, the 214
radicals used by a 17th century dictionary (the famous "Kangxi Zidian") is
very well-know, and has been used for many other dictionaries. The radicals
index itself is ordered by the stroke count method, and it gives you a
radical number.
        c) Go to the main index and find the section corresponding to that
radical number.
        d) Count the number of strokes of the remaining part (i.e. the total
hanzi's strokes minus the radical's strokes). Within the main radical
section, characters are ordered, again, with the stoke-count method based on
the remaining strokes.
        e) Accept the facts: your hanzi is not there! Your assumption about
what component constituted the radical was wrong, so go back at point (a)
and try again...
        f) You found it!

Radical indices found on dictionary are often highly redundant, i.e. all non
obvious characters are indexed under more than one radical, in order to
minimize the occurrence of the problem at point (e) above.

The three blocks of ideographs in Unicode are ordered with the radical
method, using the classical 214 Kangxi keys. So, theoretically, if you are
provided with a printed list of the 214 keys, you could work out the Unicode
charts directly. In practice, however, this is impossible because (I) you
don't have the section (radical) and sub-section (count) headings in the
Unicode table, and (II) the Kangxi order of Unicode blocks is not very
consistent, especially when simplified characters pop in, and (III) you have
of course no redundancy to stop you from looping over and over on wrong
assumptions.

The Unicode book contains a proper radical index (with redundancy, and all
the rest) to help you locating ideographs. Sadly, it does not contain a
stroke count index, that is clearly much easier for beginners.

Finally, there is a really cool site where you can experiment with both
methods:

        http://www.zhongwen.com

Here are sample searches for the "mama" ideograph above:

1. Stroke count method sample (http://www.zhongwen.com/s/bishu.htm)

        a) Count: 13 strokes (http://www.zhongwen.com/s/b13.htm).
        b) First stroke: an angle ("<"), so it is towards the end of list
(http://www.zhongwen.com/d/182/x253.htm).
        c) Second stroke: skip.
        d) Found! Now click on "Unihan"
(http://charts.unicode.org/unihan/unihan.acgi$0x5ABD) for info about code
point U+5ABD.

2. Radical method sample (http://www.zhongwen.com/s/bushou.htm)

        a) Identify the radical: we assume it is "horse".
        b) Look up the radical: 10-stroke section.
        c) Go to radical section: 187 (http://www.zhongwen.com/s/r187.htm).
        d) Count remaining part: 3 strokes, "female" component.
        e) Ooops, it's not there... Loop!
        a) Identify the radical: we now bet it is "female".
        b) Look up the radical: 3-stroke section.
        c) Go to radical section: 38 (http://www.zhongwen.com/s/r38.htm).
        d) Count remaining part: 10 strokes, "horse" component.
        e) Now it's the correct radical!
(http://www.zhongwen.com/d/182/x253.htm)
        f) Found! Click on "Unihan"
(http://charts.unicode.org/unihan/unihan.acgi$0x5ABD).

I hope this helps.
_ Marco



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:05 EDT