RE: Unicode Transliteration Guidelines released

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Jan 19 2008 - 17:48:54 CST

  • Next message: abysta@yandex.ru: "Abkhasian CHE with descender"

    > De : unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] De la
    > part de Rick McGowan
    > Envoyé : samedi 19 janvier 2008 17:58
    > À : unicode@unicode.org
    > Objet : Unicode Transliteration Guidelines released
    >
    > The Unicode CLDR committee has released
    > "Unicode Transliteration Guidelines":
    > http://www.unicode.org/cldr/transliteration_guidelines.html

    Note the following text:
    [quote]
            Even within particular languages, there can be variants according to
            different authorities, or even varying across time (if the authority
            changes its recommendation). The canonical identifier that CLDR uses
            for these has the form:

                    source-target/variant

            The source (and target) can be a language or script, either using
    the
            English name or a locale code. The variant should specify the
            authority, and if necessary, the year. For example, the identifier
    for
            the Russian to Latin transliteration according to the UNGEGN would
    be

                    ru-und_Latn/UNGEGN, or
                    Russian-Latin/UNGEGN
    (...)
    [/quote]

    This description has a CLDR bug associated with it since quite long about
    the format of the identifier. And proposed changes, plus comments,
    suggesting that the use of '-' and '_' is not coherent with existing
    practices with locale identifiers where they are treated equivalently.

    Also the placement of the variant is ambiguous if the transliteration is
    reversed.

    This bug was accepted by a CLDR comity member but delayed for later
    resolution. Apparently it is still in this status, and has been forgotten.

    I have recently proposed a solution using another format, based on pure
    locale ids (because transliteration variants are effectively creating locale
    variants by defining an alternate orthography for the associated language):
            ru.und-Latn-UNGEGN
            und-Latn-UNGEGN.ru
    And forgetting the support for languages using full names like:
            Russian.Latin-UNGEGN
    (because most of these names are not part of the CLDR Root locale and
    English names for languages are often ambiguous or could create havoc with
    some language names that must include the separators needed for parsing)

    The format should then become simply:
            <Source-locale-id>.<Target-locale-id>
    where both locale ids are adhering to the RFC definition.

    (Note that I suggest treating "." and "/" equivalently for the separator
    between the two locales, like we should treat "_" and "-" equivalently as
    tag separators within the locale id; this makes the format compatible with
    existing locale id parsers, resource bundle parsers or resolvers where "/"
    could cause problems with filesystems).



    This archive was generated by hypermail 2.1.5 : Sat Jan 19 2008 - 22:32:02 CST