Re: Unicode transliterations (and other operations)

From: Vladimir Weinstein (weiv@jtcsv.com)
Date: Wed Jul 04 2001 - 12:52:20 EDT


Peter_Constable@sil.org writes:
> There have been some messages in this thread discussing whether something
> is transliteration or transcription. On that point I have two comments:
> first, ISO TC 46 has created definitions for these two terms that apply to
> ISO standards under their purview; these definitions can be found at
> http://www.elot.gr/tc46sc2/purpose.html. Secondly, it is my impression that
> many people use the term "transliteration" in a broader sense than the
> strict definition defined by TC 46. That appears to be the case for the
> help file associated with the ICU demo, which defines transliteration as,
> "the general process of converting characters from one particular script to
> another one". Moreover, there is a need for a term to described a

This is because ICU implementation of transliteration actually allows for even more general thing - converting characters according to a given set of rules. It can be used both for transliteration and transcription as defined in TC 46.

> For example, Kashmiri (India / Pakistan) is written in Devanagari and in
> Nastaliq-style Arabic (aka Persio-Arabic); Wolaytta (Ethiopia) is written
> in Ethiopic and Roman; Tai Dam is written in Tai Dam script, in Lao script
> and in Roman with Vietnamese-style diacritics.

Let me add Serbian to this list - it is written both in Latin and Cyrillic scripts with mapping that is almost one to one.

In case of Serbian,
> There are, in principle, three potential ways to deal with publishing in
> multiple writing systems:
>
> 1. Separate documents are created manually, one for each writing system.

This method is not feasible at all in case of Serbian. .

> 2. A document is created manually in one writing system, and different
> parallel documents are generated through an automated process for the other
> writing systems.

This is the most common practice used, although with some interesting consequences, see below.

> 3. A single document is created that can be displayed in terms of alternate
> writing systems using font mechanisms, possibly relying on transduction
> done within "smart" fonts.
This one is also used.

Here is the case of Serbian. It uses 30 cyrillic letters or 30 latin letters. However, some of the letters in the latin alphabet are represented as two letters - here are the pairs:
\u0409/\u0459 == Lj/lj
\u040A/\u045A == Nj/nj
\u040F/\u045F == D\u017E/d\u017E
\u0402/\u0452 occasionally represented in latin as Dj/dj, but usually represented by \u0110/\u0111

Transliteration from cyrillic to latin is very easy. The only problem is transliteration of upper case letters above, which can be transliterated either to upper/lower case combination or to two upper case letters, depending on the case of following letters.

A little bit more complicated is transliteration of Serbian from latin to cyrillic, even when Unicode encoded, for two reasons:
1) if foreign names are not transcribed or tagged, they will be simply transliterated to cyrillic form, which is always a source of good laugh for Serbian readers,
2) this one happens extremely rarely - some words that use two-letter latin letters should be transliterated to two cyrillic letters, instead of just one. This is the case with some adopted foreign words. However, it is not of interest in everyday practice.

Interesting and wrong practice used by a lot of magazines that print in cyrillic and also have a latin Internet publication is using a latin based encoding for cyrillic version, where q, w, x and y are used for cyrillic letters that use two letters in latin representation, for example, W and w represent \u040A and \u045A. However, foreign names are not transcribed, but written in original form in latin script. So, after moving from cyrillic to latin, Washington becomes Njashington. Of course, if Unicode was used for storing the text, transliteration from cyrillic to latin would be correct and almost trivial.

My experience in transliteration says that 'pure' Unicode text is not enough for comfortable transliteration, especially for texts that tend to mix latin and cyrillic, as it is the case with most of technical texts. Some additional tagging is required to make it fully automatic. Otherwise, additional proof reading is required.

I had reasonable success in writing MS Word macros that did transliteration - things that helped were formatting foreign word differently - using italic or bold.

Hope this makes sense,
V.

-- 
Vladimir Weinstein, IBM GCoC-Unicode/ICU  Cupertino, CA,  weiv@jtcsv.com



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 13:48:07 EDT