The "wrong" font (was RE: Japan opposes...)

From: Edward Cherlin (edward.cherlin.sy.67@aya.yale.edu)
Date: Sat Apr 29 2000 - 02:58:19 EDT


The following two cases are essentially the same, so I am answering
them together. In both cases we have characters which are
historically the same, but have significantly different glyphs in
different national standards. In both cases, the national standards
may each have one glyph out of the pair, so neither national standard
can be used to write software that will display in accordance with a
different national preference. However, software that uses both
standards and can convert between them can behave properly.

Unicode contains code points for both characters, and therefore can
be used to write software that handles more than one national
preference. However, it is still not possible to satisfy both
national preferences *in the same rendering*, that is, in the same
place on the same screen at the same time, or on the same piece of
paper.

Unicode thus can do better than either national standard, but is
blamed for not being totally, unreasonably, impossibly perfect.

Case 1

At 11:14 AM -0500 4/27/2000, Beeler, George W., Ph.D. wrote:
[snip]

>Consider the following [This scenario is dreamed up by a language-naive
>American who has tried to understand the problem as explained by our
>Japanese colleagues.] --
>
>Suppose I run a health clinic in Japan, and a Korean comes to me for health
>care and provides an identification number. My system, which uses Japanese
>fonts, sends this number to a master patient index for verification. The
>master index returns to my system a message with the indexed name in
>Unicode. I display or print the patient's name for verification, but the
>verification fails because I used the right codes, but the wrong fonts. As
>a result, the patient says that the display is not his name. This failure
>to match can be very serious, as I must now suspect that the identification
>number was incorrect, and therefore may not find (or trust) prior clinical
>data about this person.

A person of Korean descent living in Japan will soon learn what
Japanese glyphs to look for. A Korean tourist or visiting
businessperson is probably not in the system at all, but is aware
that Korean names can be mangled in Japanese. If a Korean font is
available, and the display can be changed, the patient will recognize
the Korean forms of the character, and there will be no further
problem. If no Korean font is available, then the problem cannot be
corrected in software, regardless of the character set used.

The font change on the display is possible either on a system that
has both Korean and Japanese fonts, and can translate between them,
or on a Unicode-based system. The Unicode solution is much easier to
implement.

>In order to execute this scenario successfully, I would need to have used
>Korean fonts. But how does my system know this from the Unicode-based
>message? You suggested that "multilingual documents necessarily contain
>tags for sections in different languages." But our messages are not tagged
>documents, they are streams of information. To accomplish the same result,
>we need to go "beyond Unicode only" and tag each phrase (character?) with
>the language in addition to the Unicode character identity.

The display problem can be solved at the user interface level, with
an option to change screen fonts, or within the database, with a
field indicating font preference for foreign patients.

>Does not this really mean that Unicode has 'failed' to accomplish the goal,
>as it cannot, in this circumstance, stand alone, as intended?

Unicode is not intended to stand alone. There are some applications
for unformatted Unicode text files, but most applications deal in
formatted data, where the programmer can choose fonts in advance or
provide a choice of fonts to the user.

>As you said,
>"Unicode plus language tagging and language-specific fonts is the solution,
>not the problem." Is not part of the problem to avoid double-coding
>character streams?

I don't know what you mean by "double-coding". The language of a text
cannot be reliably inferred from the text, any more than you can tell
from the name whether any particular John Monroe was born in the UK,
the U.S., Argentina, or anywhere else in the world. All multilingual
applications require language tagging.

>To me, the key is in your phrase "... exercised about a single character
>that happens to have different glyphs in language-specific fonts, ..."

Yes, many of them cite two forms of one single character as the
supposed fatal flaw in the entire theory and practice of Unicode. In
the cases brought up here, the supposedly unacceptable "foreign"
glyph turned out to be well-established in the language, though
perhaps more common in a previous century.

>What
>we are trying to enable is human-to-human communication, meaning fonts are
>critical, in a context that is frequently multi-lingual.

Yes, font changes are critical in rendering multiple languages.
English-language fonts will not do for Polish, Russian fonts will not
do for Serbian, and Chinese fonts will not do for Japanese.

>Am I way off base here? And, if not, do you know of strategies that others
>have used to get around this limitation?

The Unicode standard very clearly states that formatting must be done
by means other than character encoding. This is not a limitation, but
a simple necessity. We cannot encode the size of a character in its
Unicode code point, and we cannot encode its language either.

>Yours .... George Beeler
>
>George W. Beeler, Jr., Ph.D.
>Division Chair, Mayo Clinic
>Siebens 7
>200 First St. SW
>Rochester, MN 55902 USA
>
>email: beeler@mayo.edu
>voice: 507-284-9129
>fax: 507-284-0796

Case 2

At 6:34 AM -0800 4/27/2000, Brendan Murray/DUB/Lotus wrote:
>One objection I received recently was that one might sent an e-mail from
>Japan to a recipient in China

In Japanese or Chinese? Well, let that pass.

>and the glyph would change.

This is possible with or without Unicode.

>If this Chinese
>user than printed out the document, this would contain the incorrect glyph,

The glyph cannot be incorrect in itself. In the situation described,
it is correct for the user who printed the message.

>which would cause the earth to stop spinning on its axis if this were
>snail-mailed back to the Japanese originator.

If the glyph had not changed, the printout would have been incorrect
for the recipient, and the sky would have fallen instead.

The problem is that we have hypothetical Japanese and Chinese
correspondents, neither of whom has ever heard that character sets
and fonts differ between their countries, and who have completely
chauvinistic responses when they first encounter the difference. None
of this has anything to do with Unicode.

Unicode neither creates nor solves a clash of cultures. That's a
people problem.

>While I understand that many
>people are jealous of their names, the fact that people don't use accents
>in English is not interpreted as a personal insult by those who have
>accents on their names;

No, it is interpreted as the usual cluelessness on the part of mail
software vendors. Or Microsoft. :)

>similarly most Japanese people don't take the use
>of a Chinese ideograph as a personal insult.

Unfortunately it is the tiny but loud minority that does have these
problems that creates the problem.

>Presumably spelling a Japanese
>person's name using Latin characters must be much more insulting - at least
>the Kanji originally were borrowed from China, while Latin has been foisted
>on the language by the limitations of technology.

Conversely, they appreciate foreigners who get their names rendered
in Kanji on business cards, or who take Japanese names. ("Charin
Mokurai desu. Doozo yoroshiku.")

>I'm willing to predict the next wave of objections, once the Kanji used in
>names have been encoded in Unicode: they'll complain that the fact that
>they're off the BMP means a) Unicode considers Japanese names to be
>unimportant and b) the data is all twice as long as it needs to be. Oh,
>don't forget the new symbols: the snowman in JIS X 0213 is full-width,
>while that in U+2603 is half-width. I have no doubt that we'll receive
>plenty of complaints about these too.
>
>B=

Actually, they will be switched over to Unicode by their operating
system, application, and font vendors without being able to tell the
difference. Anything that can be done in a mixture of national
standards can be done in Unicode. Fonts will soon routinely have
three or more embedded encodings--font vendor codes, one or more code
pages or national standards, and Unicode. It may take as long to get
rid of the code pages and national standards in new fonts as it is
taking to get rid of the DOS underpinnings of Windows (15 years and
counting), but it will happen.

And they will still complain. Never mind.

Ed Cherlin
Generalist
Men of one idea, like a hen with one chick,
and that a duckling.--Henry David Thoreau



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT