From: vunzndi@vfemail.net
Date: Thu Nov 20 2008 - 06:34:09 CST
Quoting "Julian Bradfield" <jcb+unicode@inf.ed.ac.uk>:
> One thing I don't really understand is the basis for the difference of
> approach between alphabetic(-ish) and Han.
>
> The UTC has said, no more precomposed characters.
>
> On the other hand, the IRG is still encoding more and more obscure
> hanzi, although surely the vast majority of them are describable using
> ideographic description sequences, mostly in a canonical way. (And for
> those characters with two equally obvious decompositions, I'm sure one
> could impose a reasonable canonicalization criterion to choose one.)
>
IDS are by definition not combining characters - this would make them
effectively stateful which is a route unicode does not wish to follow.
Therefore in unicode term CJK characters can not be decomposed using
IDS.
Actually it would be wrong to say that the newer characters are all
more and more obsurce. They are characters not in the larger
dictionaries. these include names of places, surnames; characters used
in various dialects. Characters are processed by the IRG on a first
come first serve basis, therefore "really obscure" characters
submitted in the 1990's are already encoded, whereas only some useful
"everyday" characters have yet to be encoded. Since Extension B at the
turn of the century, the average time for a proposal of new CJK
characters has effectively become 12 years.
Some additions to CJK characters are effectively adding like adding a
new script to unicode. For example, take a area that I know a little
about, the CJK characters used by the Zhuang. Zhuang the mother tongue
of over 10 million people, the 50 something largest people group in
the world, is traditionally written using CJK ideographs, these
characters have yet to be systematically encoded. When eventually
added as CJK ideographs the name will just be CJK ideograph U+XXXXX ,
the significance hidden by the naming convention.
Nevertheless this does make cjk ideographs an open ended set. If one
limits the number of components to say 300 and four components make a
character the are then 300x300x300x300 = 8,100,000,000 possible
characters 0.01% of which is 81,000 characters, close to the current
unicode count. This illustrates that the combinations people use of
components is rather limited. This begs the question as to whether
allowing unlimited combinatons is an appropriate model.
Yes, if CJK ideographs had been encoded as composites it would have
made the encoding process much easier, but everything else more work.
CJK are in some respects the exception that proves the rule.
> Why are IDSes seen as a stop-gap measure until the described hanzi is
> separately encoded, whereas combining diacritics are seen as the
> definitive way to do things?
>
Please note as stated above IDS do not combine, they are legacy rather
than stop-gap.
Regards
John Knightley
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>
>
This archive was generated by hypermail 2.1.5 : Thu Nov 20 2008 - 07:14:38 CST