Proposed Updates to Unihan.txt

From: Doug Schiffer (laotzuREMOVE_THIS@dreamscape.com)
Date: Sun Feb 14 1999 - 02:02:18 EST


I've been working on a project that involves the CCCII character set,
and I've noticed that the unihan.txt database contains translation
information for many CCCII glyphs, but there are more Unified han
characters that exist in CCCII that are not recorded.

The CCCII character set allows the same glyph representation to exist at
multiple codepoints. As an example, the Unicode codepoint
U+6D38 corresponds to CCCII codepoints 224854 as well as 2E4D3D. In
cases like this, I suggest standardized on the CCCII codepoint with the
lowest value as the definitive match. The unihan.txt database appears
to do this in _most_ cases as well.

I propose to change the unihan.txt database in the following ways, with
respect to the CCCII character set:

1) The kCCCII tag will be explicitly the CCCII codepoint with the
smallest value that corresponds to a Unicode codepoint.

2) An optional kAlternateCCCII tag be created to contain information
about duplicate CCCII codepoints. There could be from 0 to N instances
of the kAlternateCCCII tag for a given Unicode code point.

3) An optional kVariantCCCII tag be created to contain information about
CCCII codepoints that are stylistic variations of the given unicode
codepoint.

As an example of the new format, the entry for U+8F9F would contain:

U+8F9F kCCCII 215B5C
U+8F9F kAlternateCCCII 275E6B
U+8F9F kAlternateCCCII 275F69
U+8F9F kAlternateCCCII 4B5C54

As an example of the use of the kVariantCCCII tag:

U+904D kCCCII 215C3F
U+904D kVariantCCCII 275C3F
U+904D kVariantCCCII 393D70



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:44 EDT