RE: "Giga Character Set": Nothing but noise

From: Carl W. Brown (cbrown@xnetinc.com)
Date: Sat Oct 14 2000 - 20:39:36 EDT


Doug,

The problem with languages like Korean is that they are carrying a lot of
history. Today with the newer font technology there is no reason to have
preformed characters. If you were to start all over again with no interest
in compatibility with existing code pages, you could drop the preformed
characters.

This may be what they are talking about being more efficient.

You can come close to selecting han based on radicals. They probably have a
way to select among duplicate matches. Then you could cut the character
set down the bopomofo or even the Latin pinyin. Much more efficient to do
Chinese with 0 characters.

Hey with octal I can do everything with 8 characters or in binary with two.

Carl

-----Original Message-----
From: Doug Ewell [mailto:dewell@compuserve.com]
Sent: Friday, October 13, 2000 9:59 PM
To: Unicode List
Subject: Re: "Giga Character Set": Nothing but noise

John Jenkins <jenkins@apple.com> wrote:

>> Have we figured out yet what part of "Hamlet" the Giga people claim
>> cannot be encoded in Unicode?
>
> I had to do some head scratching on that one. I finally figured out
> that it was meant rhetorically. Would the inability to encode Hamlet
> be acceptable? No. So why foist on the world a character set (viz.,
> Unicode) that can't handle Chinese properly? Isn't Chinese as
> important as English?
>
> At least, I think that's what they meant.

Yes, I finally figured that out after reading the white paper and doing
a general Web search on Coventive and their "Giga Character Set." As
Ken pointed out, they are based in Taiwan and have the usual focus on
"efficient" CJK encoding and language-specific Han glyphs, along with a
deep conviction that Western-based organizations couldn't possibly get
this stuff right if they tried.

In the white paper, they tip their hand by continually referring to
"display codes" as if displaying glyphs were the only thing character
codes were used for. (What about input, storage, comparison,
collation, etc.?)

There are several misstatements about Unicode, ranging from merely
ignorant to -- David Starner had the right word for it -- outright
slanderous. First, of course, is the premise that "16-bit" Unicode has
room for only 65,536 characters. Most of the perceived shortcomings of
Unicode are based on this falsehood and can be quickly dismissed.
There is also a statement that contiguous ranges of Unicode code units
are assigned to languages, when in fact Unicode maintains a studied
ignorance of language and doesn't even require all characters in the
same script to be encoded in the same block.

Of course, there is the usual claim that "Unicode can not easily
include the new characters that continue to be formed." Try telling
someone who was in Boston or Athens recently that Unicode's rigid
structure doesn't permit the addition of new characters! Then, another
news flash: Unicode doesn't provide for the reality that "the
directionality of written language can vary." So I guess that means
the Bidirectional Algorithm, the Bidirectional Category field in
UnicodeData.txt, the directional override codes, etc. don't actually
exist.

You gotta love the separate, proprietary, *patented* algorithms that
are created to handle each specific language's "peculiarities." Note
how English, French, Spanish, German, Italian, and Portuguese -- all
at least 98% covered by Latin-1 -- each have their own GCS encodings.
When do you suppose we will see the Basque, Sami, Azeri, Yi, Thaana,
etc. algorithms? When Coventive unilaterally decides to support them?
(Ah, but they have thrown in Klingon, just to prove it can be done.)

And, of course, Coventive claims to have improved display performance
dramatically -- 1500x for Korean! -- by composing glyphs dynamically
from component pieces rather than referencing a precomposed glyph from
a "behemoth look-up table." (Do they think some kind of search must
take place to locate the glyph for code point U+mumble?) Conveniently
ignored is the fact that not all CJK characters are decomposable in
this way, the severe performance hit imposed on searching and sorting,
and the fact that an approach like this would only work for CJK in any
event.

An article in the October 12, 2000 issue of Linux Weekly News
<http://lwn.net/bigpage.php3> tries to explain the benefit: "Many
Asian characters are composites, made up of one or more simpler
characters. Unicode simply makes a big catalog of characters, without
recognizing their internal structure; GCS apparently handles things in
a more natural manner." However, the article does not go on to specify
just what is better, more efficient, or more "natural" about the GCS
approach.

(BTW, an article in the online Taipei Times mentioned that GCS assigns
4 bytes for each code point. So who's inefficient now?)

I am sorely tempted to point out that their criticism of CJK glyph
unification in Unicode could be addressed by judicious use of Plane 14
tags, but no matter; Giga is DOA. It is false economy, it attempts to
solve perceived CJK problems by introducing bogus distinctions, it
considers only one aspect of character code processing (display) while
ignoring all others, and it is the patented, proprietary work of one
company. We will never have to worry about Giga, and in a year or so
we will forget it ever existed.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT