Re: [Proposal] Extended UTF-16 by using Plane 14

From: Geoffrey Waigh (anzu@home.com)
Date: Wed Apr 14 1999 - 03:18:32 EDT


Christian Wittern wrote:
>
> Geoffrey wrote:
>
> > I understand the problem quite well. They want to adapt Unicode to encode
> > a large swath of non-Unicode type data and are now asking the rest of the
> > world to modify their software so that this non-Unicode data will be
> > processed in some fashion correctly.
>
> Hmm, err, this is not exactly the case. Two things are converging here:
> One is the need to deal with characters not currently in any standard. If
> the standard is the current Han-Character Set of Unicode, this represents
> less than 0,1% of most texts. I therefore don't think at this point, that
> they need to be encoded in a standard. They need to be represented in the
> texts, however. Also, to avoid loss of information, the texts are usually
> input with the variants actually used in the printed versions, while for the
> publication, depending on the needs of the intended user community, a
> process of normalization and unification is applied.

I'll take it that your contention about the accuracy of my above statements
is my understanding of the problem and you may well be right as new surprises
are unveiled daily. However the Unicode standard is rather adamant about not
being for glyph encoding and is firmly minded that characters be unified.
So however useful glyph variant data is to their project, it is not Unicode
type data. Any scheme that proposes extending UTF-16 and putting that in
the standard is putting a strong onus on Unicode developers to cope with it.

> The other thing is, that a group in Japan has spent the last ten years to
> collect a large number of Han-Characters. They created a set of Truetype
> fonts for more than 90 000 characters, which is available for free download
> at their website (http:www.mojikyo.gr.jp). This collection contains the
> Unicode Han area as one of its many subsets. We find it now convenient to
> use Unicode as a base character set and reference characters from this set
> as necessary. This provides a way to deal with these characters before they
> are encoded by standard bodies (and in fact will provide information on
> usage and frequency, which will help the standard bodies to decide, which
> characters to put into the appropriate slots. In this sense, it is of course
> a temporary solution). Now, although we don't want to use the characters
> from this collection that are already in Unicode, it still seems to be the
> easiest and best, to make this whole collection wholesale available to the
> users. This is what Mr. Maedera is trying to add to his Unicode Editor.

Cobbling together tools to make it easier to write standards proposals is
a handy thing, though Michael Everson seems to do an amazing job despite
his plaints of missing tools. I'm a bit puzzled as to how many users the
CJK standards proposal body community is and why they need their very
specialized needs handled by very general, widespread software. Mr.
Maedera mentioned that he is using a Win32 API which only accepts UTF-16.
While this sounds like it makes implementation much easier than bypassing
these APIs, it does not sound like justification to redefine UTF-16, have
Microsoft and everyone else update their APIs so as to minimize the impact
on the implementation of this editor and associated tools.

> > There are already 2 encodings for
> > ISO-10646 which will allow them to store huge quantities of non-Unicode
> > data compatibly. Given the implication that they would not be using the
> > existing CJK in the BMP, I would think that both UTF-8 and UCS-4 are more
> > space and processing efficient than a stream of what would be mostly
> > 12 octet sequences in their data. UTF-16 was designed for something
> > different from what they are trying to accomplish.
>
> As explained above, your implication is wrong.

You are quite right. I erroneously assumed after you stated that the Private
Use behemoth duplicated 30 000 characters they would actually be used. It
appears from your latest description (I won't be so bold as to assert there
isn't another clause waiting in the wings,) that these 30 000 characters are
duplicated but are immediately deprecated to limited (if any?) use since
there are shorter encoding sequences available in the BMP? I'm sure some
people on the list are striving hard to find sympathy for the tightness
of coding space issues you are running into because of it.

Despite the convenience the flexibility of the Unicode design has allowed
for your project, it was not intended to serve as the sole encoding
mechanism for the richness of data you want. A more efficient use of
coding space is certainly going to be more work, including the need to
have someone write extra code in that editor, but it isn't going to have
to convince the rest of the world to modify their systems either.

And to touch on one of Mr. Maedera's points, I strongly suspect that if
UTF-16 runs out of codepoints, quite a few people will deem it a failure
as a character encoding because there just not that many characters on
this planet as Unicode currently defines them.

Geoffrey Waigh



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT