Re: Unicode transliterations (and other operations)

From: Martin Heijdra (mheijdra@Princeton.EDU)
Date: Thu Jul 05 2001 - 09:01:43 EDT

Just FYI:

For a history of practices, terminology debates, of transliteration,
transcription etc., see:

Wellisch, Hans H., 1920-, The conversion of scripts, its nature, history,
and utilization / Hans H. Wellisch. -- New York : Wiley, c1978, xviii,
509 p. : ill. ; 24 cm.

The same author has a much shorter bibliography, I think superceded by this

Martin Heijdra

----- Original Message -----
From: <>
To: <>
Sent: Wednesday, July 04, 2001 4:37 AM
Subject: Re: Unicode transliterations (and other operations)

> On 07/02/2001 02:56:16 PM Mark Davis wrote:
> >For those interested in Transliteration (and other Unicode
> transformations),
> >there is a new ICU web demo program on
> >
> >
> This opens an area of some interest to me and some of my colleagues.
> There have been some messages in this thread discussing whether something
> is transliteration or transcription. On that point I have two comments:
> first, ISO TC 46 has created definitions for these two terms that apply to
> ISO standards under their purview; these definitions can be found at
> Secondly, it is my impression
> many people use the term "transliteration" in a broader sense than the
> strict definition defined by TC 46. That appears to be the case for the
> help file associated with the ICU demo, which defines transliteration as,
> "the general process of converting characters from one particular script
> another one". Moreover, there is a need for a term to described a
> particular situation that is very common around the world, and so far as I
> know the term transliteration is the only term that comes close to
> describing that phenomenon. It is this phenomenon which is the focus of
> interest for me and my SIL colleagues: a single language that is written
> different portions of the language community in different writing systems,
> particularly different writing systems based on different scripts.
> For example, Kashmiri (India / Pakistan) is written in Devanagari and in
> Nastaliq-style Arabic (aka Persio-Arabic); Wolaytta (Ethiopia) is written
> in Ethiopic and Roman; Tai Dam is written in Tai Dam script, in Lao script
> and in Roman with Vietnamese-style diacritics.
> This phenomenon is of particular interest and concern for applied
> involved in literacy and literature development: for literacy, they might
> need to assist people in learning how to make the transition between one
> writing system and another, and they certainly need to develop different
> sets of literacy materials for each writing system (probably with
> significant duplication in content). For those working on literature
> development, there is a repeated need to publish documents in multiple
> writing systems. For large publications that are developed over long
> periods of time, such as dictionaries or translations of long works such
> the Bible, issues of versioning and data management become particularly
> focal: the opus is going to be edited and revised literally hundreds of
> times: if one has to maintain three copies (corresponding to three writing
> systems) of a document through dozens of changes each working day over
> (say) an eight-year period, that is a lot of additional work.
> Clearly in situations such as this, there would be a significant benefit
> be gained if it were possible for a person to create a document in one
> writing system and have the parallel documents in the other writing
> generated by some automated processes.
> There are, in principle, three potential ways to deal with publishing in
> multiple writing systems:
> 1. Separate documents are created manually, one for each writing system.
> 2. A document is created manually in one writing system, and different
> parallel documents are generated through an automated process for the
> writing systems.
> 3. A single document is created that can be displayed in terms of
> writing systems using font mechanisms, possibly relying on transduction
> done within "smart" fonts.
> (Note that I say these are *potential* possibilities; there are additional
> factors such as whether a spelling in one writing system contains adequate
> information to determine a unique spelling in a different writing system -
> can one be generated deterministically from the other.)
> There are plenty of cases in which the first method has been used. We have
> done some implementations of both the second and the third varieties. For
> example, last year we developed a system of the second variety that
> simultaneously supports both Ethiopic and Roman writing systems using a
> custom encoding and Worldscript and GX (yes, GX, not AAT), and that is
> being used by a linguist for work on the Koorete language in Ethiopia. Our
> SIL Hebrew font package includes the third variety as a capability: the
> Ezra "Standard Encoding" permits changing between Hebrew script and
> Roman-based transliteration / transcription (it's usually called the
> former, but it's probably the latter by TC 46's definitions) by changing
> between the Hebrew or Roman-transliteration fonts included in the package.
> Some years ago, we did a Tai Dam package using Worldscript and GX (this
> first done as a trial to see how far these technologies could be pushed)
> which a single encoded representation can be displayed in Tai Dam, Lao and
> Roman orthographic representations and also a Roman quasi-phonemic
> representation (direct, un-transduced representation of the encoded data),
> and changing from one to another is a matter of simply changing fonts.
> In those situations, we created these implementations using custom
> encodings. These could potentially have been based on Unicode encoding,
> however. Now, one might think, "well, displaying a Unicode character in
> Ethiopic range using glyphs for Roman script goes against the conformance
> requirements, specifically requirement C7." That's actually not a problem,
> provided that isn't being done unknowingly on the assumption that
> characters are being rendered without reinterpretation. The
> reinterpretation is a legitimate higher-level protocol, so implementations
> of the third variety do not constitute conformance violations.
> One more note in relation to the third method: some consideration has been
> given recently into registering an OpenType feature for specifically this
> type of implementation. Because of the nature of OpenType, there are some
> definite limitations regarding what type of "transliterations" (using the
> broader definition) are possible. For example, going between, say,
> Devanagari and Roman might not be possible in OpenType due to reordering
> issues, whereas it would be possible (assuming a deterministic mapping
> the encoded representation and the two writing systems) in either AAT or
> Graphite.
> I'll stop at this point, saying that this is simply some background on
> things my colleagues and I have looked at some. We have quite a number of
> users who we are supporting who are dealing with these multiple-writing
> system scenarios in their work. There are a number of issues that are
> involved in any of these situations. The biggest are:
> - What does it take to have an encoded representation that contains all
> info needed to represent multiple writing systems based on different
> scripts?
> - What usuability issues are there in various possible implementations?
> So, I offer that as a discussion starter if others are interested.
> - Peter
> --------------------------------------------------------------------------
> Peter Constable
> Non-Roman Script Initiative, SIL International
> 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> Tel: +1 972 708 7485
> E-mail: <>

This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 13:48:07 EDT