From: John H. Jenkins (jenkins@apple.com)
Date: Thu Oct 25 2007 - 23:06:27 CDT
On Oct 25, 2007, at 8:14 PM, David Starner wrote:
> In 10/25/07, vunzndi@vfemail.net <vunzndi@vfemail.net> wrote:
>> I aaware that the original aim of unicode was to have all 'useful'
>> characters in the BMP. However as far as CJKV characters are
>> concerned
>> this has not been done, rather characters have been added on a first
>> come first serve basis.
>
> The character set standards of China, Taiwan, Japan and Korea were
> completely included in the BMP. The sets of characters that computer
> users of CJKV characters were actually using are all in the BMP. That
> was not a first come, first serve policy.
Perhaps we have a different sense of what "first come, first serve
means." To me, the fact that the PRC, Taiwan, Japan, and South Korea
already had well-established and widely used character set standards
means that their immediate needs got covered first. Vietnam, North
Korea, and didn't have their character sets even under way at the
time, so naturally their needs came later. There was not a general
survey of "useful" CJKV characters (if that term even means anything)
made before doing additions. If there had been, then nothing in
IICore would be in plane 2.
>> If the allocation of CJKV codepoints continues
>> to be donr in this way, then for modern CJKV coverage will require
>> not
>> only BMP and plane 1 support but also, in the future, plane 3 suport.
>
> (Should be plane 2, BTW.)
>
No, he meant plane 3. If the current explosion of extremely rare Han
characters continues, we'll have to start putting them in plane 3
before long.
> If it continues to be done in what way? They currently have teams of
> experts sorting through the body of writing in Han ideographs, finding
> new distinct ideographs, and identifying what most needs encoding.
> Short of God handing the next set of Han ideographs down from Mt.
> Sinai on stone tablets, I don't know what improvements can be made.
There is actual considerable room for improvement.
First of all, the experience of Extension C showed that there was a
serious QA problem in the IRG. The amount of effort involved in
identifying unifiable pairs entirely by hand left the whole process
error-prone. This has largely been corrected with Extension D work.
Secondly, the whole issue of "distinct ideographs" is getting nastier
and nastier as the IRG has to deal with increasingly rare characters
of uncertain provenance and meaning. So long as the IRG continues to
treat each "distinct" ideograph as something that needs independent
encoding, this is going to be a problem that plagues us.
If, for example, we'd had the concept of variant selectors an
established part of the standard during the Extension B work, the IRG
could have saved literally thousands of code points which are now
dedicated to obscure variants found in the Hanyu Da Zidian. If we
abandon the idea that every distinct ideograph requires separate
encoding, we could speed up the whole process, improve the quality of
work, and -- most important -- make implementation much simpler.
=====
John H. Jenkins
jenkins@apple.com
This archive was generated by hypermail 2.1.5 : Thu Oct 25 2007 - 23:08:16 CDT