Re: Unicode transliterations (and other operations)

From: Peter_Constable@sil.org
Date: Wed Jul 04 2001 - 04:37:50 EDT


On 07/02/2001 02:56:16 PM Mark Davis wrote:

>For those interested in Transliteration (and other Unicode
transformations),
>there is a new ICU web demo program on
>
>http://oss.software.ibm.com/developerworks/opensource/icu/translitdemo...

This opens an area of some interest to me and some of my colleagues.

There have been some messages in this thread discussing whether something
is transliteration or transcription. On that point I have two comments:
first, ISO TC 46 has created definitions for these two terms that apply to
ISO standards under their purview; these definitions can be found at
http://www.elot.gr/tc46sc2/purpose.html. Secondly, it is my impression that
many people use the term "transliteration" in a broader sense than the
strict definition defined by TC 46. That appears to be the case for the
help file associated with the ICU demo, which defines transliteration as,
"the general process of converting characters from one particular script to
another one". Moreover, there is a need for a term to described a
particular situation that is very common around the world, and so far as I
know the term transliteration is the only term that comes close to
describing that phenomenon. It is this phenomenon which is the focus of
interest for me and my SIL colleagues: a single language that is written by
different portions of the language community in different writing systems,
particularly different writing systems based on different scripts.

For example, Kashmiri (India / Pakistan) is written in Devanagari and in
Nastaliq-style Arabic (aka Persio-Arabic); Wolaytta (Ethiopia) is written
in Ethiopic and Roman; Tai Dam is written in Tai Dam script, in Lao script
and in Roman with Vietnamese-style diacritics.

This phenomenon is of particular interest and concern for applied linguists
involved in literacy and literature development: for literacy, they might
need to assist people in learning how to make the transition between one
writing system and another, and they certainly need to develop different
sets of literacy materials for each writing system (probably with
significant duplication in content). For those working on literature
development, there is a repeated need to publish documents in multiple
writing systems. For large publications that are developed over long
periods of time, such as dictionaries or translations of long works such as
the Bible, issues of versioning and data management become particularly
focal: the opus is going to be edited and revised literally hundreds of
times: if one has to maintain three copies (corresponding to three writing
systems) of a document through dozens of changes each working day over
(say) an eight-year period, that is a lot of additional work.

Clearly in situations such as this, there would be a significant benefit to
be gained if it were possible for a person to create a document in one
writing system and have the parallel documents in the other writing systems
generated by some automated processes.

There are, in principle, three potential ways to deal with publishing in
multiple writing systems:

1. Separate documents are created manually, one for each writing system.

2. A document is created manually in one writing system, and different
parallel documents are generated through an automated process for the other
writing systems.

3. A single document is created that can be displayed in terms of alternate
writing systems using font mechanisms, possibly relying on transduction
done within "smart" fonts.

(Note that I say these are *potential* possibilities; there are additional
factors such as whether a spelling in one writing system contains adequate
information to determine a unique spelling in a different writing system -
can one be generated deterministically from the other.)

There are plenty of cases in which the first method has been used. We have
done some implementations of both the second and the third varieties. For
example, last year we developed a system of the second variety that
simultaneously supports both Ethiopic and Roman writing systems using a
custom encoding and Worldscript and GX (yes, GX, not AAT), and that is
being used by a linguist for work on the Koorete language in Ethiopia. Our
SIL Hebrew font package includes the third variety as a capability: the
Ezra "Standard Encoding" permits changing between Hebrew script and
Roman-based transliteration / transcription (it's usually called the
former, but it's probably the latter by TC 46's definitions) by changing
between the Hebrew or Roman-transliteration fonts included in the package.
Some years ago, we did a Tai Dam package using Worldscript and GX (this was
first done as a trial to see how far these technologies could be pushed) in
which a single encoded representation can be displayed in Tai Dam, Lao and
Roman orthographic representations and also a Roman quasi-phonemic
representation (direct, un-transduced representation of the encoded data),
and changing from one to another is a matter of simply changing fonts.

In those situations, we created these implementations using custom
encodings. These could potentially have been based on Unicode encoding,
however. Now, one might think, "well, displaying a Unicode character in the
Ethiopic range using glyphs for Roman script goes against the conformance
requirements, specifically requirement C7." That's actually not a problem,
provided that isn't being done unknowingly on the assumption that
characters are being rendered without reinterpretation. The
reinterpretation is a legitimate higher-level protocol, so implementations
of the third variety do not constitute conformance violations.

One more note in relation to the third method: some consideration has been
given recently into registering an OpenType feature for specifically this
type of implementation. Because of the nature of OpenType, there are some
definite limitations regarding what type of "transliterations" (using the
broader definition) are possible. For example, going between, say,
Devanagari and Roman might not be possible in OpenType due to reordering
issues, whereas it would be possible (assuming a deterministic mapping from
the encoded representation and the two writing systems) in either AAT or
Graphite.

I'll stop at this point, saying that this is simply some background on
things my colleagues and I have looked at some. We have quite a number of
users who we are supporting who are dealing with these multiple-writing
system scenarios in their work. There are a number of issues that are
involved in any of these situations. The biggest are:

- What does it take to have an encoded representation that contains all the
info needed to represent multiple writing systems based on different
scripts?

- What usuability issues are there in various possible implementations?

So, I offer that as a discussion starter if others are interested.

- Peter

---------------------------------------------------------------------------
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 13:48:07 EDT