Re: Glyphs of new Unicode 3.0 symbols

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Nov 30 1998 - 14:55:00 EST


Roman suggested:

>
> Speaking of Unicode 3.0 (thank you all for the many enlightening
> details!) I would like to express my wish for the following additions
> to the Unicode 3.0 CD-ROM for implementor's convenience:
>

> 2. Add an "age" field to the unidata.txt to specify since which
> Unicode version each character has been defined:
> "1.0", "1.1", "2.0", "2.1", or "3.0"

This is under active consideration for a much revised and extended
form of the Unicode Character Database data to accompany the release
of the Unicode Standard, Version 3.0. However, do not expect it to
simply be an additional field for the UnicodeData-X.Y.Z.txt file. The
format and field content of that file have been fixed for long enough that
there are multiple implementations out there that parse it with
particular assumptions about its format. There is an ongoing discussion,
but chances are that new data files will be introduced, with similar,
but new formats, for additional information provided about characters
in the future.

>
> 3. Add an "ASCII transliteration" mapping to each Unicode character
> so that it can be rendered readable in ASCII contexts

This suggestion got thoroughly chewed over last week. Suffice it to
say that this is *way* down the priority list for those of us working
on the properties, attributes, and sundry characteristics of characters.
I consider this to be A) a black hole, and B) a great opportunity for
the vendors and industrious entrepeneurs to come up with appropriate
solutions for different classes of applications and groups of customers.
It is certainly not ripe for an ad hoc standardization by the
Unicode Consortium.

>
> 4. Make the names.txt equivalent to the book's charts by illustrating
> it with UTF-8 characters, for example
>
> 0025 % PERCENT SIGN
> x (arabic percent sign - 066A ٪)
> x (per mille sign - 2030 ‰)
> x (per ten thousand sign - 2031 ‱)

This is, of course, a fairly simple thing to do, but it has annoying
edge cases, since there are four digit years and four digit standards
citations in the file that have to be filtered so they don't produce
erroneous conversions. (For an example of the problem, see the note under
U+0197 in the Unicode Standard, Version 2.0.)

The transformation from the format of the text-only version of the
names list to the formatted, final version of the names list is fairly
complex and subtle. We will certainly again be placing the text-only
version of the names list on the CD-ROM, but the amount of special-purpose
massaging we do to it is a matter of resource contention with other
tasks for publication.

>
> 6. Add mapping tables for the other ISO standards listed as source
> standards in chapter R.1 but not in mappings/iso*/

As someone else speculated, much of this information is not just
"available" and being held back -- it is implicit in mountains of
standards documents, explicit but scattered in various vendors'
implementations of mappings, but not sitting ready somewhere to just
stick on the CD-ROM.

We'll put what we have available, but even reviewing and updating the
sometimes outdated information in the tables we *do* have is going
to be a major task.

Frank asked:

> > > 237E BELL SYMBOL
> >
> If this is an approved addition, and it is indeed a picture of a
> bell, it can be unified with the "Picture of Bell" character in the
> "Additional Control Pictures for Unicode" proposal.

This one is from ISO 2047 (see also DIN 66 213). Yes, it could (and should)
have been unified with U+2407 SYMBOL FOR BELL, but that is not what the
ISO committee decided. But this is not the only instance in which a graphic
representation of a control code has taken on a life of its own as a
separate graphic character. Think of U+237E BELL SYMBOL now as a cute
little mushroom with legs in the technical symbols area. A propos for
representation of a door buzzer, or whatever... But if the terminal
graphics proposal needs both a "BEL" and a character for a picture of
a bell, this is it.

Roman asked:

> And is U+3004 JAPANESE INDUSTRIAL STANDARD SYMBOL no corporate symbol?

Yep, but there are always exceptions. This one is in Unicode because,
although this symbol is not in JIS standards (X 0208, for example), it
is universally used in Japanese JIS dictionaries as a little symbol to
indicate the JIS value of a character. So even if someone could point to
a claim that this is a trademarked logo, it has been genericized by
usage.

Tim Partridge asked:

> Back to the subject of what would be useful on the Unicode 3.0 CD, how about
> a list of the characters used by various languages? (Perhaps with
> classifications like "essential" and "only in foreign words".) Could the
> European subsetters be persuaded to contribute their data? The Cyrillic and
> Arabic blocks also merit attention.

This would be a nice thing to have, but is also a tremendous amount of
work and an open-ended project, since there are disagreements about the
status of various letters even within well-known languages, and there
are potentially 1000's of languages to deal with.

As for whether the European subsetters could be persuaded to contribute
their data, it might be more efficient for us to simply point at their
results for European languages when they stabilize and are available
in a public place. [CEN Workshop Agreement (CWA) on Alphabets of Europe]

--Ken Whistler



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:43 EDT