From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Dec 17 2002 - 16:53:43 EST
Marco commented:
> Another key point, IMHO, is verifying the following claim contained in the
> proposal document:
>
> "Tibetan BrdaRten characters are structure-stable characters widely
> used in education, publication, classics documentation including Tibetan
> medicine. The electronic data containing BrdaRten characters are
> estimated beyond billions. Once the Tibetan BrdaRten characters are encoded
^^^^^^^^^^^^^^^^^^^^^^^^^
> in BMP, many current systems supporting ISO/IEC10646 will enable Tibetan
> processing without major modification. Therefore, the international standard
^^^^^^^^^^^^^^^^^^^^^^^^^^
> Tibetan BrdaRten characters will speed up the standardization and
> digitalization of Tibetan information, keep the consistency of
> implementation level of Tibetan and other scripts, develop the Tibetan
> culture and make the Tibetan culture resources shared by the world." [BTW,
> billions of what!?]
The Chinese delegation at the WG2 meeting agreed with a restatement of
this as "gigabytes of data". Exactly what kind of data, they did not say,
but in principle that could consist of a few medium-size databases. It
almost certainly does not consist of billions of *documents*.
> I'd propose the following:
>
> 1. Find all the available technical details about this BrdaRten
> encoding.
One additional detail for people. The BrdaRten stacks are currently
implemented, in the Founders System software in Tibet, as an extension
to GB 2312.
> 2. Come up with a precise machine-readable mapping file between
> BrdaRten encoding to *decomposed* Unicode Tibetan, possibly accompanied by a
> sample conversion application.
> Reasons: (a) to make it easy to migrate BrdaRten legacy data to
> Unicode; (b) to easily update existing BrdaRten applications to export
> Unicode text; (c) to easily retrofit new Unicode applications to import
> BrdaRten text.
See the key words "without major modification" above. If the BrdaRten
stacks were encoded in Unicode, they would automatically become part
of GB 18030 (because of the UTF-like nature of that strange standard).
However, the catch is that the actual code points for Unicode/10646 are
not predictable or controllable by the Chinese NB. That means that the
final code points in GB 18030 are also not predictable -- and almost
certainly are not the same as those used by the current GB 2312 extension
in Tibet. And *that* means that the current "characters ... estimated
beyond billions" will have to be migrated to a new encoding, anyway,
once the systems are updated to GB 18030. That is the reason for the
quibble word "major" in the phrase above. All the data will be reencoded,
but the transition GB 2312 + Tibetan extension ==> GB 18030 containing
Tibetan extension is viewed as "just a mapping" and not a major system
modification.
The alternative (and even scarier) prospect is that the existing GB 2312
Tibetan extension code points would be forced as is into a new version
of GB 18030, invalidating the mapping for the existing code points,
and creating a completely new version of GB 18030 that would have to
be supported as a different "code page" from the existing GB 18030. This
would start us down the road to a indefinite number of distinct GB 18030
mappings, probably not properly labeled in interchange, with large numbers
of interoperability problems predictable (and likely to dwarf the JIS
yen sign/backslash kinds of problems). The reason this prospect is even
thinkable is that any existing implementation of the BrdaRten stacks
in a GB 2312 extension would surely be using 2-byte character encodings,
and a transition to 4-byte GB 18030 character encodings would likely
disrupt the existing implementations significantly.
The question for Unicoders is whether introduction of significant
normalization problems into Tibetan (for everyone) is a worthwhile tradeoff
for this claimed legacy ease of transition for one system, when it is
clear that all existing legacy data using these precomposed stacks is
going to have to either be reencoded anyway (or surrounded by migration
filters for new systems).
--Ken
This archive was generated by hypermail 2.1.5 : Tue Dec 17 2002 - 17:27:42 EST