From: John Jenkins (jenkins@apple.com)
Date: Wed Jan 21 2004 - 13:13:33 EST
On Jan 21, 2004, at 6:36 AM, Andrew C. West wrote:
> If a simplified form of a given CJK ideograph is used, then it
> deserves encoding
> properly. There are newly-coined simplified forms in CJK-B and CJK-C,
> so why not
> add newly used simplified forms to CJK-C or whereever if they are
> really needed
> ? To borrow Michael's term, this use of variation selectors is simply
> pseudo-coding.
>
Well, first of all, there were a *lot* of mistakes made in Extension B.
And Extension C isn't encoded yet. The UTC intends to lobby WG2 to do
the encoding of such forms via variation selectors.
The whole point of using variation selectors is that the line between
character and glyph can sometimes be a fuzzy one, and Han is probably
the worst case. In the case of TC and SC, it's just as easy (in many
cases, where there's a one-one, algorithmic relationship) to see the
two forms as glyphic avatars of a single, Platonic character. Such a
representation, via variation selectors, aids a number of processes,
such as fuzzy searching, text-to-speech, and so on, because you don't
require new tables to do a match.
Indeed, right now I have to periodically run checks on the Unihan
database to make sure that TC/SC pairs have the same readings. It's a
pain.
From an end-user perspective, there is *NO DIFFERENCE* between
representing these characters using variation selectors and direct
encoding. They can show up in input methods and fonts just the same.
> 1. Unicode Design Principle 3 : "The Unicode Standard encodes
> characters, not
> glyphs."
> This is simple glyph variant. I insist on writing the "A" in my name
> with two
> cross-bars. Will the UTC kindly accommodate me by providing an
> appropriate
> standardised variant for U+0041 ? (In fact, come to think of it I have
> idiosyncratic ways of writing all of the letters in my name ...)
>
Well, a personal name ideograph is perhaps not the best example, since
the size of the "personal name" problem is unknown. IIRC nobody's won
Rick's contest yet. The goal was to come up with an instance where
some people make a distinction and others don't. In any event, the
example is not entirely tongue-in-cheek. First of all, all three of my
Cantonese-English dictionaries contain a variant turtle ideograph which
isn't encoded yet. (I haven't looked in Extension C, BTW.) Secondly,
the original Korean proposal for Extension C contained literally dozens
of variant turtle ideographs.
The difficulty here -- and this leads into the third example -- the
Koreans derived their characters from a soft copy of the Korean
tripitaka. Now, I would assert that these variant turtles are probably
just variant turtles, chosen idiosyncratically by the scribe for
whatever reason. (Rather the way that 16th and 17th century English
books have fairly random and inconsistent spelling.) If it is
absolutely necessary to embody this variation, it would be better to
use rich text. Unfortunately, it's impossible to know for certain
whether this is the case or not, and so variation selectors are
available to make a distinction possible in plain text for those who
care about it.
Granted, epigraphy is tough on plain text. As Unicode starts to deal
with dead scripts, we have to deal with the issues it raises.
Variation selectors are one way of doing it.
> The plain fact of the matter is that the *character* turtle is already
> encoded,
> and if someone wants to use a different glyph form for this character
> then he or
> she should design their own font with the appropriate glyph mapped to
> U+9F9C or
> U+9F9F.
>
Or any of the other turtles we already have.
========
John H. Jenkins
jenkins@apple.com
jhjenkins@mac.com
http://homepage..mac.com/jhjenkins/
This archive was generated by hypermail 2.1.5 : Wed Jan 21 2004 - 15:00:45 EST