Making use of UTF-16 area for CJK

From: Jake Morrison (jake@sequent.cprbei.cdc.com)
Date: Tue Aug 13 1996 - 22:31:22 EDT


There has been a lot of argument about the merits of
Han unification. I feel that the most important thing
at this point is that Unicode/ISO 10646 be usable for
the majority of people and a valid choice for
implementation on a national basis in Asian countries.
With Han unification, we have recorded the most common
CJK characters and encoded them in the BMP where simple
UCS-2 software can access it. Now we have to handle the rest.

Most of the remaining CJK characters can be placed in one
of three categories: names, national/local variants
(e.g., Vietnamese or Cantonese characters) and rare/archaic
characters interesting only to scholars.

I think the best solution is to allocate parts of the
UTF-16 area in blocks to the standards organizations in the
individual countries and to scholarly groups. For example,
Taiwan's CNS 11643 currently holds more than 50,000
characters, with more on the way. Simply give them a block
big enough to hold these characters (excluding those already
encoded in the BMP).

This will make Unicode immediately usable for any given
country. Unicode can be chosen with the confidence that any
local character worthy of recording in the national set will
be included. Local needs will be satisfied--once a block is
allocated for HK, there will be no complaining by Mandarin
speakers that a Cantonese character should not be included.

There will of course be some duplication. But in the case of
names, the correct area to look for any given character will
be clear--if you are looking for the character for a rare
Japanese name, look in the Japanese block.

For scholarly purposes, there will always be argument about
whether two characters are the same or different. There is
probably still a need to standardize a large block of archaic
characters. One possiblilty is to encode one of the
comprehensive traditional scholarly dictionaries wholesale.
A scholar can then refer to the dictionary area first, then
think of encoding the character in his or her private use area.
  
With the 1 000 000 characters in the UTF-16 area, we have
some room. Lets use it to solve some problems. By encoding
blocks of characters, we could cover the requirements of
99.9 percent of users while avoiding political squabbles
that would delay introduction of these characters indefinitely.

Giving each national group their own block of characters might
just be enough of an incentive that they would be willing to
go outside of the BMP. Resulting software support for UTF-16
or UCS-4 would open the door for real multilingual computing
with a common character set.

Regards,
Jake Morrison

-------
Control Data Systems, Asia/Pacific Region E-mail: Jacob.Morrison@cdc.com
6/F, 131 Nanking E. Rd., Sec. 3 Phone: 886-2-715-2222
Taipei, Taiwan Fax: 886-2-7129197



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT