Re: Identifiers

From: Florian Weimer (fw@deneb.enyo.de)
Date: Mon Apr 16 2001 - 14:37:26 EDT


DougEwell2@cs.com writes:

> > > In general, the problem is unsolvable. There are several look-alikes
> > > among the Cyrillic, Greek, Latin and Cherokee blocks, among others.
> >
> > And those are not equivalent under normalization? That's a pity.
>
> As others have explained, Unicode does not specify (nor should it) any type
> of "normalization" mechanism to equate similar-looking glyphs that belong to
> different scripts.

There should be a method to overcome the source sepearation rule which
might have saved certain identical characters from unification.

> - U+0048 LATIN CAPITAL LETTER H
> - U+0397 GREEK CAPITAL LETTER ETA
> - U+041D CYRILLIC CAPITAL LETTER EN
> - U+13BB CHEROKEE LETTER MI

If this were Han glyphs, they would have been unified, wouldn't they? ;-)

> There is nothing wrong with this, because as humans we normally have no need
> to identify the script of a single isolated glyph, or else we have some
> context to help us make that determination (such as, H comes after G).

I know that lack of unification is not a problem for humans reading
some document, but it can get pretty complicated as soon as computers
are involved. I've helped people to cope with an environment were the
glyphs from ISO-8859-* were not unified, and this has certainly some
hairy consequences.

I don't think it's a general Unicode problem, but you have to know
about this issues in order to design protocols which permit a large
Unicode subset in identifiers and can nevertheless be used
sucessfully.



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:16 EDT