Re: Identifiers

From: DougEwell2@cs.com
Date: Mon Apr 16 2001 - 14:07:20 EDT


Florian Weimer <fw@deneb.enyo.de> wrote:

> > It will always be necessary for people to think a bit when creating
> > their email addresses,...
>
> Well, you can't expected people to know most of Unicode just to choose
> an email address. :-/

and then later:

> > In general, the problem is unsolvable. There are several look-alikes
> > among the Cyrillic, Greek, Latin and Cherokee blocks, among others.
>
> And those are not equivalent under normalization? That's a pity.

As others have explained, Unicode does not specify (nor should it) any type
of "normalization" mechanism to equate similar-looking glyphs that belong to
different scripts.

One of the primary purposes of Unicode is to support many scripts with the
same character set, instead of requiring different 8-bit code pages for
Western European Latin, Eastern European Latin, Latin + Greek, Latin +
Cyrillic, etc. As a result, if this were a Unicode document (encoded in
UTF-8 or by other means), it could contain the glyph H by itself and you
might not have any visual way to tell whether it was:

  - U+0048 LATIN CAPITAL LETTER H
  - U+0397 GREEK CAPITAL LETTER ETA
  - U+041D CYRILLIC CAPITAL LETTER EN

or even

  - U+13BB CHEROKEE LETTER MI

There is nothing wrong with this, because as humans we normally have no need
to identify the script of a single isolated glyph, or else we have some
context to help us make that determination (such as, H comes after G).

I don't know what would be the intent of a person who deliberately inserts a
similar-looking Greek or Cyrillic letter in the middle of some Latin text. I
do have at least one KOI8-R document which has Latin A and O in place of the
proper Cyrillic versions, but that just shows that this is a multi-script
issue that has little or nothing to do with Unicode.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:16 EDT