From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Nov 09 2003 - 17:55:12 EST
From: "Peter Kirk" <peterkirk@qaya.org>
> >Not at all ! May be with supplementary markup of my sentence
> >it will be more clear:
> > A "starter sequence" (defective or not) is then an
> > _unordered_ set of {
> > _ordered_ sequences of {
> > characters having the same combining class
> > }
> > }.
> >Then look at where I used the term "set" defined by this sentence, and
the
> >term "element" refers to element of the unordered set, i.e. the "ordered
> >sequence of characters having the same combining class".
> >
> >
> OK, this time you are right and I am wrong; although your definition
> does not include all canonically equivalent orderings of your "starter
> sequence" because it excludes ones in which a combining character in
> class b is ordered between two of class a, a not equal to b.
Here again this definition is clear: the coded sequence <a1, b, a2>
contains the unordered set { <a1, a2>, <b> }, why do you want that
a1 and a2 are in separate elements of the set when they match the
definition of "characters having the same combining class".
Note however that the sentence is taken out of its context, which also
indicates a definition constraint for "starter sequences". More formally:
- if the unordered set contains an element which is an
ordered sequence of characters of combining class 0 (starters),
then this sequence must contain only one character, and this
character must be the first one coded in the starter sequence.
- if such element is present, the "starter sequence" is "non-defective"
else it is "defective".
An interesting property is that a defective starter sequence is necessarily
also part of a defective combining sequence.
But the reverse is false: a "defective combining sequence" is not
necessarily
made of any "defective starter sequence".
For example: <LF, COMBINING ACCUTE> is a *non-defective* starter sequence,
but contains the defective combining sequence <COMBINING ACCUTE>, after
the isolated <LF> control (which is not technically a combining sequence,
but
is not defective).
The interest of that definition is that almost all Unicode algorithms are
actually working on very basic "starter sequences", and not on "combining
character sequences" which can be parsed only after precise definition of
character properties.
And canonical normalization _guarantees_ to preserve *only* "starter
sequences" (defective or not), but not necessarily "combining character
sequences" (defective or not), and that's where care must be taken when
encoding text...
This archive was generated by hypermail 2.1.5 : Sun Nov 09 2003 - 18:30:44 EST