From: Kenneth Whistler (kenw@sybase.com)
Date: Tue Jan 20 2004 - 14:27:09 EST
John Jenkins tried to present some usage cases for Han FVS
combinations, and Mike Ayers responded with a bunch more questions:
> Ummm - if this simplified form were used at all, wouldn't it already
> be encoded? Isn't there a process for getting such encoded? Has this
> process broken down, or have some of its assumptions been shown invalid?
If that simplified form were used at all, it would be *in use*, not
necessarily encoded. Not all Chinese printed material has gone
through a computer encoding to be set in type, and even material
that is represented via computerized typesetting may have been
set in fonts that apply regular simplification rules to some glyphs
that may not actually occur in the GB standards for these things.
> Huh? You forgot the part about "the font designer psychically
> already knew how Mr. Turtle draws his name and encoded the glyph for it,
The fact is that thousands of such oddball variants already *do* exist
in print, which means that some "font designer" someplace already did
so. Well, the instance in "print" may actually be a handwritten or
carved form. They are less likely to occur at random in modern computer
fonts, but even there, more or less random collection of "gaiji" get
added to the fonts and then may be used in one context or another.
> ... Are you saying
> that there is a known limit to the number of character variants, and that
> there is an establishable correspondence between these variants such that a
> logical connection between a variant and one of a set of FSV is possible?
> Call me skeptical...
The real problem that the committee is dealing with is that there are
a number of significant collections of such kinds of variants,
particularly in Japan. And ways need to be found to interoperate with
software that implements such lists, lest de facto alternate
encodings spring up that would undermine the case for universal usage of
Unicode in East Asia.
To date, extensions to Unicode including variants of already-encoded
characters, have ended up just being the adding of more variants as
"unified" Han characters. But carried too far, that dilutes the
identity of the core character itself.
The alternative being investigated is to consider such things as
turtle-variant-17 to simply be representable by a sequence such
as <2A6C9, E0180>, rather than having to add yet *another* variant
turtle character on its own.
> Whoa, Nellie!
>
> Did "represent newly discovered characters" creep into the mission
> statement of plain text when I wasn't looking?
This has *always* been part of the agenda of the encoding committees.
If you are representing Han data as Unicode plain text, and you
run into a "newly discovered character", you are stuck. Your options
are:
1. Use a "geta" (U+3013), i.e. throw up your hands and punt.
2. Use an Ideographic Description Sequence to get an approximate
description as a substitute.
3. Ask the character encoding committees to encode the character
(a process that will take a long while).
4. Ask the character encoding committees to make the character
representable by a designated variation sequence (a process
that also make take a long while, but which could shortcircuit
things considerably if the known lists of these things were
all processed ahead of time).
--Ken
This archive was generated by hypermail 2.1.5 : Tue Jan 20 2004 - 16:10:00 EST