Re: Synthetic scripts (was: Re: Private Use Agreements and Unappr oved Characters)

From: John H. Jenkins (jenkins@apple.com)
Date: Fri Mar 15 2002 - 14:03:16 EST

Previous message: jarkko.hietaniemi@nokia.com: "RE: Synthetic scripts (was: Re: Private Use Agreements and Unappr oved Characters)"
In reply to: Dan Kogai: "Re: Synthetic scripts (was: Re: Private Use Agreements and Unappr oved Characters)"
Next in thread: Dan Kogai: "Re: Synthetic scripts (was: Re: Private Use Agreements and Unappr oved Characters)"
Next in thread: jarkko.hietaniemi@nokia.com: "RE: Synthetic scripts (was: Re: Private Use Agreements and Unappr oved Characters)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Friday, March 15, 2002, at 09:15 AM, Dan Kogai wrote:

> You may say this can be resolved by regarding each Kanji not as a
> character but a word (lexically speaking this does make sense) then use
> some sort of ligature to represent one. That way you can reduce the
> number of code point down to the number of Bushu.

Actually, this is not quite true.

Composition schemes for Han are straightforward in theory but very
difficult to put into practice. For one thing, most of the phonetic
elements aren't bushu at all. For another, specifying the geometries
involved is very complex, and since most of the phonetic elements
available themselves can be broken down into smaller pieces, you have
difficulty specifying where to stop.

Composition schemes for Han also have even more difficulty handling simple
processes such as equivalence since two glyphic forms for the same
character may have slightly different decompositions. Lexical processing
is also complicated by the inherent presence of variable-length units.

Unicode 3.0+ does have a mechanism for *describing* ideographs which are
not yet encoded. You may want to examine that to get a fuller account of
the complexities involved in encoding Han via decomposition.

> But this approach has already failed when Unicode 2.0 decided to give
> all theoretically possible Hangul distinct code points, unlike Unicode 1.
> 0 which used ligature model to represent one char. As a result Hangul
> now even has more code points than Traditional Chinese.

Actually, Unicode 2.0 does *not* give every theoretically possible Hangul
syllable a distinct code point. There are a number of archaic ones which
cannot be represented in the precomposed form.

The number of precomposed Hangul in Unicode is only 11,172, which is
rather fewer than the number of characters in even pre-existing
traditional Chinese character sets such as Big Five. Unicode 1.1 through 2.
1 had 20,902 Han ideographs, and the number now tops 70,000. The vast
majority of the characters added so far have been added for traditional
Chinese: fuller coverage of CNS 11643 and complete coverage of the KangXi
dictionary. At a rough guess, there are more than five times as many
characters for traditional Chinese in Unicode than for Korean.

> With this Unicode Consortium has lost a good reason to reject new
> proposal to add more characters. If elvish get the code points why not
> real, alive language get more?
> CJK has made the greatest compromise -- the compromise that hardly paid
> off in consequence -- when Unicode was first created. They accepted the
> code point sharing though that hardly make sense linguistically.

Unihan makes perfect sense linguistically. The fundamental unity of the
characters has never been questioned, just whether or not the typical
writing styles found in various East Asian locales is sufficient to
justify deunifying them.

> Then Unicode 2.0 and Hangul Expansion, then Surrogate Pair. What's next?
> Making Unicode 128 bit like IPv6 address so you can include Tengwar and
> Klingon with less objection? I can't help but say give me a break.

May I suggest that you update your understanding of Unicode to something
more current than Unicode 2.0?

Meanwhile, there is no anticipated need for any expansion of the Unicode
code space beyond its current 1,000,000+ available code points. The
issues regarding Tengwar and Klingon have nothing to do with room or the
lack thereof.

> I confess I enjoyed this thread of whether Tengwar should be include in
> Unicode. It's fun. It's cute. But isn't this too much for those who
> accepted the compromise for UNIcode? Tengwar should wait till more
> critical issues are resolved. Many (including me ) would be pissed if
> Tengwar be added BEFORE Ciao-Ciao's poetries and Man-Yo-Shu become
> encodable in Unicode.

Do you have any specific examples of characters in Ciao-Ciao's poetry or
the Manyoshu which are missing? If so, you've got a couple of weeks to
propose them before the door closes on new characters for Han Extension C.

==========
John H. Jenkins
jenkins@apple.com
jenkins@mac.com
http://homepage.mac.com/jenkins/

Previous message: jarkko.hietaniemi@nokia.com: "RE: Synthetic scripts (was: Re: Private Use Agreements and Unappr oved Characters)"
In reply to: Dan Kogai: "Re: Synthetic scripts (was: Re: Private Use Agreements and Unappr oved Characters)"
Next in thread: Dan Kogai: "Re: Synthetic scripts (was: Re: Private Use Agreements and Unappr oved Characters)"
Next in thread: jarkko.hietaniemi@nokia.com: "RE: Synthetic scripts (was: Re: Private Use Agreements and Unappr oved Characters)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Fri Mar 15 2002 - 13:28:22 EST