Re: Umlaut and Tréma, was: Variation selectors and vowel marks

From: Peter Kirk (peterkirk@qaya.org)
Date: Thu Jul 15 2004 - 04:13:07 CDT

  • Next message: Cristian Secarã: "how they will actually enter the CGJ character ? was: Umlaut and Tréma, etc."

    On 15/07/2004 05:00, Asmus Freytag wrote:

    > At 01:52 PM 7/14/2004, Doug Ewell wrote:
    >
    >> It's not German data (with umlauts) that will be affected by this
    >> solution, but non-German data (with diaereses) in German bibliographic
    >> systems. That makes it a much smaller problem.
    >
    >
    > the use of diaeresis is perfectly valid for words in fields that have
    > a language ID 'German'.
    >
    >> The DIN request and the USNB solution didn't address this, because the
    >> problem to be solved was disambiguating {a, o, u}-with-tréma from
    >> {a, o,
    >> u}-with-umlaut. If there are combinations of (for example)
    >> a-with-tréma-and-something-else AND ALSO
    >> a-with-umlaut-and-something-else, then those two will need to be
    >> disambiguated somehow. But I strongly doubt that the latter case exists
    >> in German bibliographic data, though of course one never knows.
    >
    >
    > First off, there have to be corresponding entries in the sorting
    > tables used for such data, to make that distinction have the correct
    > effect. Since the sorting tables would not support anything ohter than
    > <BASE, CGJ, DIAERESIS> there's no reason to introduce other sequences
    > into the data.
    >
    > Secondly, the dieresis is used to indicate that two vowels are
    > pronounced separately. I haven't seen a case where the vowels would
    > already be accented.

    There are such cases (although in most but not all of them technically
    the vowel is not "already" accented because the diaeresis is encoded
    closer to the base letter than the accent). This is certainly the case
    in Greek, where diaeresis (indicating separate pronunciation) and
    accents commonly occur on the same vowel; there are precomposed forms in
    the Greek and Coptic and Greek Extended blocks. There are also a number
    of precomposed forms in Latin Extended-B and Latin Extended Additional
    with both diaeresis and another accent. Presumably these are used for
    some language or other (well, some for Pinyin, some for Livonian, others
    unspecified). And so they may occur in German bibliographic data. And in
    that database each of them must have been encoded either with umlaut or
    with tréma (presumably because they are understood as marking either a
    vowel quality modification or a separation), and there is at least the
    possibility that some combinations may have been encoded differently in
    different places in the database. (And foreign words may be used within
    book titles marked as German.) Therefore Unicode does need to consider
    the issue, both as a theoretical one (which is essentially equivalent in
    terms of its effect on normalisation to the theoretical problem with
    using variation selectors with combining characters) and potentially as
    a practical one.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Thu Jul 15 2004 - 04:14:38 CDT