Re: An A is an A is an A

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Aug 29 1996 - 13:18:05 EDT


keld@dkuug.dk writes (in response to kenw@sybase.com) & kenw responds in turn:

> > This discussion may be a little misleading regarding what the characters "are"
> > and what is required in processing.
>
> Yes, but please be careful to distinguish betwen the ISO standard
> ISO/IEC 10646 and Unicode.

Agreed.

>
> > U+0041 is ALWAYS an A, forever and forever.
> >
> > The sequence U+0041 U+0301 is canonically equivalent to U+00C1 A (LATIN
> > CAPITAL LETTER A WITH ACUTE, if your mailer trashes that). A conformant
> > process "shall not assume that the interpretation of two canonical-
> > equivalent sequences are distinct." This means that I cannot claim
> > that I had U+0041 U+0301, but you interpreted it as U+00C1, and you're
> > wrong. It DOES NOT mean that all processing is much more complex. It
> > depends entirely on what processing is going on.
>
> This is only true in Unicode, ISO 10646 does not have this equivalence.

It is true that ISO 10646 does not rigorously define canonical equivalence.
It implies that in a Level 3 implementation, a "composite sequence" is
equivalent to a precomposed character (see Note to Clause 23.3, for example).
However, there is not enough information on which to base an implementation
of combining marks. Filling that gap is one of the reasons why the
Unicode 2.0 conformance clause spells out the definition of canonical
equivalence and the UNIDATA.TXT Unicode Character Database contains complete
information from which all canonical equivalent pairs can be derived.

The failure of 10646 to provide enough information to enable a Level 2 or
Level 3 implementation is, unfortunately, part of the driving force behind
the push to restrict implementations to Level 1, and concommitantly, encode
more and more precomposed letter/accent combinations as characters because
that is the only way to access them in Level 1 implementations.

>
> > If I am doing string copies into buffers, there is no difference whatsoever.
> >
> > If I am doing text matches for other than exact binary matches, then some
> > table lookup is involved, which may require lookahead even in Level 1
> > implementations. Whether this table lookup is "much more complex" using
> > combining characters depends on your implementation of the lookup.
>
> Now you are seriously confusing things. In level 1 (of ISO/IEC 10646),
> you cannot use combining characters and there is thus not
> equivalence as you state.

That claim has nothing to do with combining characters per se. Any non-one-to-one
matching or collating algorithm has to account for lookahead, even without
combining characters. If I want to match _ (sharp s) against an -ss- sequence
in text, for example, I have to do lookahead in the target text. This is NO
different than the lookahead done in support of combining characters for
similar matching, and there are implementation proofs which use the same
code for both.

>
> There is no level 1 on Unicode (as far as I know), it is all level 3
> in 10646 sense.

The Unicode Standard makes no formal distinction of levels of implementation
in the sense that ISO 10646 does. However, implementations of Unicode are
free to restrict their interpretation of Unicode characters to subsets
which fall within the constraints of what would constitute a level 1
implementation of ISO 10646. A very common first implementation of
Unicode is to restrict interpretation of characters to U+0000..U+00FF (i.e.
the Latin-1 subset), which are all non-combining, and which thus constitute
a level 1 implementation in the 10646 sense. [Example: Java, until recently,
was Unicode-based but restricted to interpretation of the Latin-1 subset.]

--Ken Whistler



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT