Re: New BMP characters (was Re: [very OT] Documentation: beyond 65

From: Thomas Chan (thomas@atlas.datexx.com)
Date: Wed Feb 21 2001 - 00:10:29 EST

Next message: DougEwell2@cs.com: "Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in"
Previous message: John Hudson: "Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in"
Maybe in reply to: Joel Rees: "Re: New BMP characters (was Re: [very OT] Documentation: beyond 65"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Tue, 20 Feb 2001, Kenneth Whistler wrote:

> rees@server.mediafusion.co.jp asked:
> > I just signed up today after receiving a note from a fellow macperler that
> > 3.1 (extension B) included 40,000+ new Kanji. I checked unicode.org and
>
> For the area of your concern, a very important data file, Unihan.txt,
> is not yet posted, so that is a file where we may have to respond to
> last-minute bug reports.

Will that file contain volume/page/position references to the _Kangxi
Zidian_, _Hanyu Da Zidian_, and other sources, as appropiate? (e.g.,
kIRGKangXi, kIRGHanyuDaZidian, et al.)

From what I've seen of a mapping table, this basic information wasn't
available for the G- sources (most or all non-electronic).

> > And, this is kind of important to me and my company, how much consideration
> > has been given to the mojikyo?
>
> By the Unicode Consortium, very little. You would have to contact
> the IRG members to see what impact the mojikyo had on the repertoire
> submitted by Japan for unification.

There is not much public information to go by, but it seems to me that
there would be very little, if that is the route that Mojikyou's Han
characters had to be funneled through for inclusion in CJK Extension B.
But the needs addressed by Mojikyou may still be largely addressed by
Unicode 3.1, through different channels.

A little note on Mojikyou first. Mojikyou's ~90,000 characters are not
all Han characters--I do not know exactly how many aren't. Of those that
are Han characters, 20,902 of them are surely already in Unicode because
they are the original URO set. Finally, the first ~50,000 codepoints in
Mojikyou are those of the _Dai Kanwa Jiten_ dictionary (aka Morohashi,
and hereafter referred to as such), which has overlap with some Chinese
non-electronic sources (e.g., dictionaries), so there is a chance that a
Han character in Mojikyou is incidently in Unicode because of a submission
from a Chinese source. Overlap with Morohashi would be one partial way to
gauge comprehensiveness.

Section 10.1 of PDUTR #27 "Unicode 3.1" (2000.1.17) gives the sources of
the 42,711 new characters as:

  KangXi dictionary ideographs (including the addendum) not already
    encoded in the BMP
  Hanyu Da Zidian ideographs not already encoded in the BMP
  Ci Yuan
  Ci Hai
  Hanyu Da Cidian
  Chinese Encyclopedia
  Founder Press System
  Siku Quanshu
  Hong Kong Supplementary Character Set
  CNS 11643-1992, 4th plane
  CNS 11643-1992, 5th plane
  CNS 11643-1992, 6th plane
  CNS 11643-1992, 7th plane
  CNS 11643-1992, 15th plane
  JIS X 0213: 2000, level 3
  JIS X 0213: 2000, level 4
  PKS 5700-3: 1998
  TCVN 5773: 1993
  VHN 01: 1998
  VHN 02: 1998

[1] http://www.unicode.org/unicode/reports/tr27/

No figures are given, unfortunately. The first source, the _Kangxi
Zidian_ dictionary, is a pre-modern (1716) Chinese dictionary, and there
is a good chance that most of its characters overlap with the
1950's (or possibly a more recent edition) Morohashi, compiled by
sinologist MOROHASHI Tetsuji. (It is not clear from PDUTR #27 what the
"addendum" refers to, but see comments later below.) The second source,
the _Hanyu Da Zidian_, is a 1980's dictionary from mainland China, and
just as massive in its coverage of characters as the aforementioned Kangxi
and Morohashi, and is more recent, so there is also a good chance of
overlap. The next four sources, the _Ci Yuan_ dictionary, the _Ci Hai_
dictionary, the _Hanyu Da Cidian_ dictionary, and the "Chinese
Encyclopedia" (no Chinese name given here) are all 20th century
sources--it is not clear which editions where used, though. The
seventh source, "Founder Press System", is not transparent to me, but
sounds like a publishing package. The eighth source, the _Siku Quanshu_,
is a 19th century Chinese collectanea; however, it is not clear if this is
solely the original work, or includes latter supplements (to include
later works, or works censored at time of publication of the original).
The remaining sources are Hong Kong's HKSCS, which is probably only of
interest for its coverage of Cantonese dialect characters (Mojikyou's
coverage is reportedly lacking in this area); Taiwan's CNS 11643, which
may also be promising, especially in planes 5 and 6 for archaic
characters; Japan's JIS X 0213; South Korea's PKS 5700 (I do not know what
the relevance of this and the previous may be to a Mojikyou user); and
Vietnam's TCVN 5773, which contributes Vietnam-specific chu+~ no^m
characters, of interest, since Mojikyou is perhaps one of the few means to
process and publish chu+~ no^m at present, and VHN 01 and VHN 02 (two
others from Vietnam, but I don't know anything about them).

(I'm not singling Unicode out for criticism for lack of details about
editions and such, as the IRG and ISO sources are just as vague.)

(Abbreviations like G_BK for "Chinese Encyclopedia", G_FZ for "Founder
Press System", and G_4K for "Siku Quanshu" might not seem mnemonic, unless
one knows that "Chinese Encyclopedia" is really "Zhongguo Baike
Quanshu"--"baike" being 'myriad (lit. hundred) fields-of-study'; "Founder
Press System is "Fangzheng Paiban Xitong"; and the "siku" of "Siku
Quanshu" refers to its four main bibliographic categories, "si" meaning
'four'.)

However, the "Proposed New Characters--Pipeline Table" page[2]
(2001.2.5) says:

  These constitute all remaining unencoded ideographs from the Kangxi
  Dictionary, the Han Yu Da Zidian, a set of 6356 characters from Japan,
  908 Hong Kong government characters, 169 characters from Korea, 29,794
  characters from TCA in Taiwan, and 4050 characters from Vietnam.

[2] http://www.unicode.org/unicode/alloc/Pipeline.html

Its not clear how the sources in PDUTR #27 relate to the figures in the
Pipeline table, but if Mojikyou was funneled through JIS X 0213 solely, or
through whatever figures the 6356 figure represents, then that is not very
much--the lion's share really seems to be from two Chinese dictionaries
(_Kangxi Zidian_ and _Hanyu Da Zidian_) and Taiwan's CNS 11643, but
possibly applicable if the interest is in historic characters.

Meanwhile, N777 "CJK B Cover Note"[3] (2000.12.20) gives the breakdown of
the 42,711 as:

  18,486 Ideographs in Kangxi Dictionary, including the one [sic] in the
    Addendum of the Dictionary ["buyi" --TC], that have not been not [sic]
    encoded in UCS/BMP
  28,914 Ideographs in Han Yu Da Zidian that have not been not [sic]
    encoded in UCS/BMP
  Unique Hanzi from
    Ci Yuan, 66 Ideographs
    Ci Hai, 247 Ideographs
    Hanyu Da Cidian, 553 Ideographs
    Chinese Encyclopedia, 86 Ideographs
    Founder Press System, 65 Ideographs
    Siku Quanshu, 522 Ideographs
  Ideographs from Hong Kong, Korea, TCA and Vietnam ...
  ... 1,081 Hanzi from Hong Kong Supplementary Character Set
  ... from JIS X 0213: 302 Kanji ...
  ... 166 Hanja from PKS 5700-3: 1998
  ... 5642 Hanja from DPR Korea Standards, KPS 9566-97 Hanja and KPS
    10721-2000 Hanja ...
  ... 30,177 Hanzi from TCA-CNS 11643-1992/4th plane
                        TCA-CNS 11643-1992/5th plane
                        TCA-CNS 11643-1992/6th plane
                        TCA-CNS 11643-1992/7th plane
                        TCA-CNS 11643-1992/15th plane
  ... 4,232 ChuNom [sic] from TCVN 5773: 1993
                              VHN 01: 1998
                              VHN 02: 1998

[3] http://www.cse.cuhk.edu.hk/~irg/N777_CJK_B_CoverNote.pdf

These numbers are clearly pre-unification, and for some reason North
Korean standards are mentioned here, but not in the previous two sources.
Since only 302 are listed as coming from JIS X 0213, there are probably
others submitted by Japan that overlap with the first two Chinese
dictionaries--which may incidently overlap greatly with Morohashi.
Also, in this source, the _Kangxi Zidian_ appendix named "buyi" is named
(although the "beikao" one is not).

There are, certainly, some parts of Mojikyou that are not addressed, such
as the proto-Chinese jiagu 'shell and bone' script (Jpn. "koukotsu")--I
notice WG2 N2314 "Graphic representation of the Roadmap to the SMP, Plane
1 of the UCS"[4] (2001.1.9) says:

Jiaguwen (Bone and Shell) script is unified with CJK

whereas Mojikyou does not unify them. (Can someone tell me when WG2 made
this unification decision?--the roadmap didn't say so in earlier
versions.)

[4] http://www.egt.ie/standards/iso10646/plane1-roadmap-table.html

From what I've seen of the Mojikyou package, it contains what would be
considered z-level variants, so those would also not be addressed by
Unicode.

It seems it really comes down to defining what subset of Mojikyou one is
interested in or is using. I've heard of input projects of historic
materials that formerly used CNS 11643 which switched to Mojikyou, which
suggests a significant overlap (and perhaps existence of mapping tables);
Vietnamese chu+~ no^m appear to be well-covered; etc.

Thomas Chan
tc31@cornell.edu

Next message: DougEwell2@cs.com: "Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in"
Previous message: John Hudson: "Re: Perception that Unicode is 16-bit (was: Re: Surrogate space in"
Maybe in reply to: Joel Rees: "Re: New BMP characters (was Re: [very OT] Documentation: beyond 65"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT