From: Peter Kirk (peterkirk@qaya.org)
Date: Mon Nov 10 2003 - 06:58:02 EST
On 09/11/2003 22:45, Philippe Verdy wrote:
>From: "Peter Kirk" <peterkirk@qaya.org>
>
>
>
>>On 09/11/2003 14:55, Philippe Verdy wrote:
>>
>>
>>
>>>...
>>>
>>>And canonical normalization _guarantees_ to preserve *only* "starter
>>>sequences" (defective or not), but not necessarily "combining character
>>>sequences" (defective or not), and that's where care must be taken when
>>>encoding text...
>>>
>>>
>>>
>>>
>>>
>>>
>>Surely not. A combining character sequence consists of an optional base
>>character followed by one or more combining characters. Canonical
>>normalisation preserves the sequence of combining characters only,
>>although it may reorder this sequence. It also preserves without
>>reordering the juxtaposition of this seuqence to the optional base
>>character. Therefore the combining character sequence is preserved.
>>
>>
>
>That's where we differ:
>The combining character sequence differs from what I define a starter
>sequence:
>(1) by the fact it can contain more than one class 0 characters (starters),
>namely all class 0 combining characters (gc=Mn), and
>(2) by the fact that a combining character sequence cannot contain some
>class 0 characters (like unagreed PUAs controls and line/paragraph
>separators which are treated individually, but not as a combining character
>sequence).
>
>The second difference is less critical for us (what it does is that it
>creates occurences of defective combining character sequences in the middle
>of the text), but the first one is critical here...
>
>
This does not affect my argument. A combining character sequence, as
defined, does not perfectly fit your definition "an unordered set of
sequences of characters having the same combining class." But it is
preserved under canonical normalisation. Well, perhaps that depends what
you mean by "preserved". If you mean that its code point representation
is unchanged, that is not true your starter sequences either. If it
means that its semantics are unchanged, it is true by definition of any
string of Unicode characters that its semantics are unchanged by
canonical normalisation, or indeed by any transformation into a
canonically equivalent form.
>I still maintain that there's no terminology to designate what I call a
>starter sequence.
>
>
>
Agreed. But does it matter? It does so only if this is a meaningful unit
within Unicode. On my understanding, a sequence of combining characters
all of class >0 is meaningful because this is what canonical reordering
operates on. But such a sequence does not necessarily form a unit with
the preceding character.
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Mon Nov 10 2003 - 07:40:49 EST