Folding algorithm and canonical equivalence

From: Peter Kirk (peterkirk@qaya.org)
Date: Sat Jul 17 2004 - 17:25:09 CDT

  • Next message: Asmus Freytag: "Re: Folding algorithm and canonical equivalence"

    I was just reviewing the UTR #30 draft in response to Rick's notice
    about it. And I believe I may have found a point in which the folding
    algorithm as given may violate the principle of canonical equivalence.
    But I would like some clarification from list members before providing
    formal input on this point.

    Consider a sequence made up of a base character B and two combining
    marks M1 and M2, in which the combining class of M1 is less than that of
    M2. <B, M1, M2> and <B, M2, M1> are canonically equivalent
    representations of the same sequence, but only the former is in
    canonical order. Suppose that a folding is defined including the
    operation <B, M2> -> X, but no other relevant operations. When this
    folding is applied, according to the folding algorithms defined in
    sections 4.1.1 and 4.1.2 of the UTR #30 draft, in step (a) the sequence
    <B, M2, M1> will be folded to <X, M1> and will not be further changed,
    but the sequence <B, M1, M2> will not be changed at all by the folding
    because the sequence <B, M2> will never be found. (By contrast, a
    folding operation <B, M1> -> Y will be applied to both sequences,
    because the canonical decomposition step converts <B, M2, M1> to <B, M1,
    M2> and the folding operation is re-applied and finds a match the second
    time.) The implication is that folding of two canonically equivalent
    strings gives different (and not canonically equivalent) results.

    This is not a purely theoretical point. The Diacritic Folding as
    specified in
    http://www.unicode.org/reports/tr30/datafiles/DiacriticFolding.txt
    includes operations like 05D1 05BC -> 05D1, i.e. <BET, DAGESH> -> BET,
    but no general rule to delete DAGESH (or any other combining marks; I
    think there needs to be such a rule, and I have already posted a formal
    response saying that). Sequences like <BET, DAGESH, PATAH> are very
    common in Hebrew text, and commonly written in this order which is
    logically correct and preferred by current rendering technologies, but
    the canonical order is in fact <BET, PATAH, DAGESH>; thus both sequences
    will be found in data depending on whether or not it has been
    normalised. The effect of applying Diacritic Folding exactly as
    specified is that <BET, DAGESH, PATAH> is folded to <BET, PATAH>, but
    the canonically equivalent <BET, PATAH, DAGESH> is unchanged. (In fact I
    consider that both should be folded to just BET, but that is not what
    the current data file specifies.)

    I hope I have not totally misunderstood the folding algorithm here. But
    it seems to me that what is missing in the algorithm is an initial step
    of normalising the data. The introductory text to section 4 seems to
    suggest that this has been avoided because folding may need to preserve
    the distinction between NFC and NFD data - although the algorithm as
    presented does not in fact do this. Since in practice the input data is
    not necessarily in either NFC or NFD and there is no easy way to detect
    which is being used, the only meaningful approach is for the user of the
    folding to specify whether the output of the folding should be NFC or NFD.

    Of course there might be a real requirement for a folding which, for
    example, removes DAGESH when combined with BET (but not with other base
    characters) irrespective of what other combining marks might intervene.
    But such foldings would need a considerably more powerful folding algorithm.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Sat Jul 17 2004 - 17:27:34 CDT