Characters not Glyphs (Was Re: Reviewing IETF documents)

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Apr 16 2001 - 15:18:20 EDT


Harald Alvestrand responded:

> At 23:42 15.04.2001 +0200, Florian Weimer wrote:
> >DougEwell2@cs.com writes:
> >
> > > I hope that the claim of "multiple UTF-8 representations" does
> > > indeed refer to glyphs, in the sense that Unicode contains both
> > > precomposed characters and separable elements, halfwidth and
> > > fullwidth ASCII variants, etc.
> >
> >Yes, I was referring to (sequences of) Unicode characters which
> >represent glyphs looking very similar.
>
> for the least disgusting treatment of this subject in the IETF, refer to
> draft-ietf-idn-nameprep-03.txt.

Neither normalization nor "nameprep" is intended to resolve away all
issues of glyph ambiguity in rendered Unicode.

Normalization is a *character* folding, not a *glyph* folding. And as
John Cowan noted in a separate thread on this topic initiated by Florian
Weimer, no character normalization is ever going to fold away already
encoded character "lookalikes" in distinct scripts -- with the classic
case being lowercase Latin o, lowercase Greek omicron "o", and lowercase
Cyrillic o. Since these three (and many other letters) are *deliberately*
made to have identical or near-identical glyphs in multiple-script
harmonized fonts such as Times-Roman, there is no way you are ever
going to avoid situations involving visually ambiguous glyphs that cannot
be mechanically transcribed (or optically scanned) back from rendered
copy without sophisticated context analysis and heuristics. -- And which
even then could be deliberately spoofed.

And depending on the resolution of your fonts, there are many other
accidental lookalikes out there that would easily fool optical scanners
or naive users. Who is going to know that U+16B7 RUNIC LETTER GEBO GYFU G ("X")
is not U+0058 LATIN LETTER CAPITAL X in a sans-serif font? For that matter,
how hard is it to mix up Latin-1 'x' with Latin-1 '×' MULTIPLICATION SIGN?
You are basically at the mercy of the font designer to make distinctions
you feel are critical to keeping characters apart.

Nameprep just adds case folding and a filtering step on top of
normalization. The filtering is aimed at removing characters that
would result in proscribed characters after normalization (such as
characters whose decompositions would include U+002E '.', and thereby
screw up domain name syntax.

But nameprep does *not* do glyph folding, nor could it even conceivably
do so, in an fontless context.

>
> we would love to see UNICODE and/or ISO give stable and useful references
> to get us out of trouble in this space.

But if wishes were horses...

I cannot see how it is the responsibility of the Unicode Consortium
or of JTC1/SC2/WG2 to eliminate the problem that in the union of
all writing systems and symbologies of the world, present and past,
there are many entities that have similar or even identical appearances.
And since characters are not glyphs, and since font designers have the
freedom to design how they will for particular markets and users,
those committees could not in principle even predict what pairs of
characters will end up having the same sets of dots turned on in
particular rasterizations of particular glyphs at particular resolutions
using particular font technologies on particular devices.

If you are working on an Internet protocol that is that sensitive to
glyph distinctiveness, then you need to be thinking outside the
box somewhat. Your requirement is then for constrained repertoires
with OCR fonts, or for some other completely unambiguous rendering
mechanism such as bar codes.

It is a complete fantasy to presume that a universal character encoding,
intended for representation of text in *all* writing systems, can
be made visually completely unambiguous on rendering by some fiat
of normalization magic.

> (note: stable is VERY important....people don't love updates that require
> changing software in every PC and PDA in the world....)

I've got no complaint with stability. Any standard or supporting
algorithm that the Unicode Technical Committee should be as stable
as humanly feasible. The UTC is well aware of the costs you are
alluding to.

However, getting back to Florian Weimer's concern -- there is never
going to be any *standard* solution provided by the UTC for the
"problem" of making it possible for 3-line PDA displays on
cell phones to display 94,140 distinct Unicode characters
intelligibly and unambiguously to all users of such devices. It
is just the nature of the beast when it comes to dealing with
a *universal* character set.

--Ken



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:16 EDT