Re: Umlaut and Tréma, was: Variation selectors and vowel marks

From: Doug Ewell (dewell@adelphia.net)
Date: Thu Jul 15 2004 - 10:47:58 CDT

  • Next message: Mark Davis: "Re: Importance of diacritics"

    Peter Kirk <peterkirk at qaya dot org> wrote:

    >> Nobody doubts that some text exists with multiple accents on vowels.
    >> Where the vowels are not Latin a,o,u, there is no issue at all, in
    >> this case, since there are no differences in German sorting for them.
    >
    > Well, yes, but http://std.dkuug.dk/jtc1/sc2/wg2/docs/n2819.pdf, does
    > not make it clear that the <CGJ, DIAERESIS> sequence is to be used
    > only with Latin a, o and u; rather it states "<CGJ, [DIAERESIS]> →
    > tréma". Perhaps the proposal needs modification to make this point
    > clear, if that is the intention.

    The wording you are looking for is in the first paragraph under the
    heading "Alternative solution":

    "The solution consists, essentially, of using U+034F COMBINING GRAPHEME
    JOINER (CGJ), in its intended semantics in 10646/Unicode, to make the
    relevant sorting, searching, and data mapping distinctions required for
    umlaut versus tréma."

    Note carefully the words "relevant" and "required." The solution
    proposed in N2819 is:

    * RELEVANT only to the characters ( ä ö ü Ä Ö Ü ) which can occur in
    German bibliographic data AND in which the diacritic may represent
    either umlaut or tréma, and

    * REQUIRED only in contexts where this distinction must be made in plain
    text.

    N2819 does not propose <CGJ, U+0308> as a general-purpose representation
    for combining tréma. It is not being proposed for the text of the
    Unicode Standard, or as a UAX or even a public-review issue. The
    solution is intended for the German bibliographers, and presumably for
    anyone else who needs ("required") to make the same ("relevant")
    distinction.

    > Second, N2819 does not make it clear that the <CGJ, DIAERESIS>
    > sequence is to be used only for Latin script data. I would expect
    > (someone can check this of course, and without checking this is indeed
    > speculation) that there is Greek text in German bibliographic
    > databases in which the Greek diaeresis is represented in ISO 5426 as
    > tréma rather than umlaut; that would be correct because the function
    > of Greek diaeresis is separation rather than vowel modification.

    Unless there is Greek text where U+0308 can represent either a tréma OR
    an umlaut, and unless there is a need to make the distinction in plain
    text -- both of which we know not to be true -- this solution is neither
    relevant nor required.

    > And I would expect an implementer reading N2819 to conclude that all
    > ISO 5426 trémas should be converted to <CGJ, DIAERESIS> as no mention
    > is made of a restriction to Latin script or to just a, o and u.

    I would expect an implementer to read the whole document and understand
    the context in which it is intended.

    > So there is a real chance of a conversion program producing sequences
    > which could confuse normalisation, e.g. <IOTA, CGJ, DIAERESIS, ACUTE>,
    > although hopefully not <IOTA, ACUTE, CGJ, DIAERESIS> which might be a
    > real problem.

    Nobody should be rushing to build conversion programs to convert U+0308
    sequences as described in N2819, unless their client is the German
    library network. Even I won't be doing it, and you know how I am about
    conversion programs.

    > My concern as always is with the apparent inconsistency of bending the
    > normal rules or ignoring the normalisation concerns for German while
    > refusing to do more or less the same for Hebrew. I appreciate that
    > Germany is a larger and richer country than Israel and so, at least
    > for commercial interests, its concerns deserve some priority. But that
    > should not be a reason to reject as invalid or insignificant issues
    > concerning Hebrew. And the issue of avoiding incompatible
    > representation of the same data is a real one for Hebrew Holam Male
    > vs. Vav Haluma just as it is for German umlaut vs. tréma.

    Ken Whistler already tried to explain, I think twice, that this use of
    CGJ to affect collation has nothing to do with your proposal to use
    variation selectors to affect rendering of combining marks.

    And I already tried to explain, at least twice, that the N2819 solution
    does *not* affect normalization. This is explained very clearly in the
    document. You are not reading.

    You will get nowhere at all, and lose any remaining credibility, by
    claiming that these decisions are being made based on political or
    economic favoritism rather than technical differences.

    > I am not actually asking for variation selectors with combining marks
    > because I realise that the UTC has already made a decision and is
    > unlikely to reverse it. But I am asking for some flexibility on some
    > of the principles, of the kind which has been demonstrated with umlaut
    > and tréma, and also in the Indic scripts proposal under review, in
    > order to find an acceptable solution to a real problem.

    OK, readers, whom does Peter sound like?

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Thu Jul 15 2004 - 10:49:54 CDT