Re: Letters for Indic transliteration

From: Mark Davis (mark.davis@jtcsv.com)
Date: Wed Jul 20 2005 - 16:38:41 CDT

  • Next message: Michael Everson: "Re: Letters for Indic transliteration"

    First, ISO uses the term 'transliteration' to mean a reverseable transformation, and 'transliteration' to mean a non-reversable one. However, not everyone follows that practice, so it is best to explicitly say 'reversable transliteration' when you mean it.

    True reversibility is a bit complicated. First, it is only reversible when the domain is restricted. That is, for example, you can have a transliteration T that maps any sequence of Greek letters and punctuation to Latin and back to Greek. But T("aα") => "aa" can't be reversed, since you don't know which started out as Latin and which Greek. This is similar to casing: toLowercase("McGowan") can't be reversed. Secondly, when you design it, the reversibility is usually in one direction. That is, you can design it to do Greek -> Latin -> Greek reversibly, but not (at the same time) Latin -> Greek -> Latin reversibly. Thirdly, the reversibility may be limited to "well-formed" strings.

    Thus, for example, an initial breathing mark in (ancient) Greek gets converted to an 'h', so
    Οἱ => Hoi

    Notice also that the H then bears the capitalization. And when converting back, that breathing mark gets pushed to the second vowel. All well and good, but take the odd original Greek following. It would, unless something special were done, map to the same value:
    Ὁι => Hoi

    And casing causes some ugly problems -- take ἡ: depending on the words around it, it would be best transliterated as either "He" or "HE"; eg
    ἡ μεταφορὰ => hē metaphorà

    Ἡ μεταφορὰ => Hē metaphorà

    Ἡ Μεταφορὰ => Hē Metaphorà

    Ἡ ΜΕΤΑΦΟΡᾺ => HĒ METAPHORÀ

    Choosing the best form depends on some assumptions as to the original text, and may not be determinable.

    What we usually do in ICU is to add an accent (one otherwise not resulting from the transliteration) to indicate an unusually formatted source, to enable reversibility. So, for example, that gets us reversibility with the forms of sigma:

    σὲ - ςὲ => sè - s̱è => σὲ - ςὲ

    Because the framework makes it easy to chain transformations, one can easily remove the accent (the transliterator "Greek-Latin; nfd; [\u0331] remove; nfc" does that for this case).

    Where letters need to be separated, because otherwise the source would be ambiguous, we follow the Japanese and Korean transliteration standards' practice of inserting a punctuation character between them to get reversibility.

    ...πσο... ...ψο... => ...p'so... ...pso... => ...πσο... ...ψο...

    ‎Mark

    ----- Original Message -----
    From: "Richard Wordingham" <richard.wordingham@ntlworld.com>
    To: <unicode@unicode.org>
    Sent: Wednesday, July 20, 2005 12:21
    Subject: Re: Letters for Indic transliteration

    > Andreas Prilop wrote:
    >
    > > U+090B and U+095C are different letters of the Hindi alphabet
    > > with different pronunciation. They need different Latin letters
    > > in transliteration since transliteration is supposed to be 1-to-1.
    > >
    > > U+090B is R with ring below
    > > U+095C is R with dot below
    > >
    > > They are needed *at the same time* in Hindi (and other Indic
    > > languages).
    >
    > Surely the key point of transliteration is *reversibility* (a.k.a.
    > round-tripping). For example, when transliterating Yi, 'p' and 't' serve as
    > both consonant and tone mark without any ambiguity. After all, one does not
    > use different symbols to transliterate U+090B (the independent vowel) and
    > U+0943 (the dependent vowel). So, does round-tripping actually fail if the
    > same symbol is used for U+090B and U+095C?
    >
    > Richard.
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Wed Jul 20 2005 - 16:39:52 CDT