Re: Synonyms in Unicode Terminology

From: Peter_Constable@sil.org
Date: Thu Jan 28 1999 - 14:14:53 EST


       Let me comment on some replies to my definition of writing
       system:

       RM>A writing system can include more than one alphabet, or
       script, as an element -- and always includes some behavior.

       I agree, that a writing system includes behaviours, but my
       definition didn't allow for it to include more than one script.

       Carl-Martin wrote:

       However, the concept of "writing system" is lacking, and I
       would see it in a somewhat different way than Peter Constable
       did in his
       contribution...

       In my understanding, a writing system is a concept located on a
       higher level: It is the totality of graphical symbols of the
       semiotic system used by a certain language community.
       Consequently, a writing system may comprise more than one
       script, and in fact even the English (or German or French ...)
       writing system allows for e.g. Roman numbers, technical
       symbols, etc. which, in our daily written communication,
       co-occur with Latin script letters on panels, technical
       instructions, etc. No-one, however, would enumerate e.g. Roman
       numbers in the alphabet. On the other hand, the set of
       different scripts admitted in a definite writing system is
       limited, since the English language community will not
       understand e.g. a Devanagari script unit (as, say, an
       abbreviation or a technical label): no communicative value has
       been defined for this symbol.

       The "classical" example for a complex "writing system",
       according to this understanding, would be the Japanese system
       using (at least!) three scripts simultaneously.

       I dare state that in terms of information technology, "writing
       system" almost corresponds to "locale", at least as far as the
       use of graphical symbols is concerned.

       [end of quotation]

       I agree that Carl-Martin's perspective is different from mine,
       though I think his use fit with the way Rick was using the
       term.

       Let me expand a little on how I arrived at my definition. (This
       is also the definition being used by my co-workers with the
       Non-Roman Script Initiative and those developers in SIL that
       are working on implementing multilingual capabilities in SIL
       software.)

       My perspective comes from my background as a linguist
       previously working in Southeast Asia and now working as part of
       a team trying to address the needs of enabling software to work
       for *all* of the world's thousands of languages, minor as well
       as major. (You can look at www.sil.org/ethnologue/ to see what
       our mandate is.) As we have looked at these issues, it has not
       been numerals and technical symbols that have presented the
       biggest challenges. Rather, it is the world's scripts, and
       minority language orthographies based on those scripts. In what
       we have been needing to do, technical symbols (other than IPA)
       were not at all in our minds as we talked of writing systems.

       We recognised that, as people talk of scripts, they can tell
       unambiguously (assuming familiarity) that a character belongs
       to some particular script but not to others. For example,
       nobody would dispute that the character which has the Unicode
       name ETHIOPIC SYLLABLE XWA belongs to a script that most people
       call "Ethiopic". At the same time, there are many languages
       that are written using Ethiopic script, and not all of them use
       the character just mentioned. Likewise, not all of these
       languages necessarily have the same collation sequence
       (collation sequences certainly can't be the same if the
       orthographic inventories aren't the same). For that matter, for
       one of these languages, it's possible that there may be more
       than collation sequence involved.

       The way a given script is used in a particular language
       includes an orthographic inventory that is a subset of those
       characters in the script, and it includes language-specific
       collation sequences. It can also include language-specific
       behaviour. Let me give an example with Thai script: This script
       includes certain characters which are written above other
       characters, such as SARA II, MAI EK, MAITAIKHU. Let's call this
       set Cs (combining (superior)). Now, in the implementation of
       this script for the Siamese (Std Thai) language, MAITAIKHU
       cannot co-occur with any other Cs character. This is part of
       the behaviour of Siamese writing, and software implementations
       often enforce this behaviour. This happens, for example, in
       Thai versions of Microsoft software. When Thai script is used
       for writing other languages spoken in Thailand (e.g. Bru) which
       have quite different phonology, however, it may be necessary to
       combine MAITAIKHU with other Cs characters. Thus, the writing
       behaviour of Siamese and Bru are different.

       As we enable our software to work with Amharic, Tigrinya and
       Gurage, Siamese, Eastern Red Karen and Bru, we need to define a
       collection of information that describes everything related to
       how a script is used for writing a particular language. This
       isn't the same as the script, because a script is implemented
       differently in different languages. It isn't the same as the
       language, which includes more than just script-related
       information, and a given language can be written with
       completely different scripts (more on that in a moment). We
       needed something below the level of script, below language, and
       "writing system" is what we chose.

       Now, there is also the issue that a given language may be
       written with more than one script. Both Rick and Carl-Martin
       have referred to this. Carl-Martin gave Japanese as an example,
       and I suspect Japanese and Korean may have been in Rick's mind.
       I have to admit that CJK is not what I'm most familiar with,
       and it wasn't foremost in our thinking as we grappled with
       these issues. There are many cases of languages which are
       written with more than one script, but I think Japanese and
       Korean are exceptions to the norm, even if they are the cases
       most familiar to a lot of people.

       In Japanese and Korean, a single writer will use Chinese
       characters and Hangul, Chinese and Katakana and Hiragana, and
       will even use them in a single document, on a single page, and
       in a single sentence. In these languages, there are certain
       words that can only be written using one or the other script,
       and so a writer may be forced to alternate. There are far more
       cases in the world, however, in which a given writer will use
       exclusively one script, usually because that's the only one
       that they know, and that is the norm for their language. A few
       examples:

       - Serbo-Croatian: written by some using Latin script and others
       using Cyrillic
       - Tai Dam (spoken in Vietnam, Laos, US, France): written by
       some using traditional Tai Dam script, by others using
       Vietnamese-style Latin, and by others using Lao
       - Tai Lue (spoken in Yunan, Laos, Thailand): written by some
       using Lanna script, by others using New Tai Lue script (a
       simplifying revision of Lanna script with enough changes that
       it should be considered a different, even if related, script)
       - Koorete (spoken in Ethiopia): written by some using Ethiopic
       script, by others using Latin
       - Wolaytta (spoken in Ethiopia): written by some using Ethiopic
       script, by others using Latin
       - Hindi/Urdu: written by some using Devanagari, by others using
       Nastaliq Arabic
       - Duruwa (spoken in India): written by some using Devanagari,
       by others using Oriya

       This is but a small sample of a situation that is evolving as a
       large number of minority languages are just beginning, or have
       yet, to become literary languages. There are other minority
       languages in Ethiopia that use both Ethiopic and Roman; there
       are other languages in India that use more than one script of
       that region, and this is probably true of some neighboring
       countries; I suspect that this situation occurs in Insular
       Southeast Asia; and there are numerous languages in Southeast
       Asia, where languages are often spoken in 2 to 4 countries and
       may also have traditional scripts, for which this is or likely
       will become the case.

       In all of these situations, a given document would generally
       appear in only one script; if more than one script is ever used
       in a single document, it would be a polyglot in which the
       different scripts are clearly separated.

       In summary, for our software development needs, we have needed
       to define a term which represents the combination [ language x
       script ] and have called this "writing system", and have chosen
       to define writing system to be the implementation of a single
       script, since that is by far the most common case we will have
       to deal with. We will need to consider how the cases of
       Japanese and Korean will impact us, so it has been good for me
       that this discussion has forced me to think about these cases a
       little more.

       Before I finish, I had mentioned some other sources which gave
       a definition of writing system in line with our use, and I
       thought I'd just mention those:

       The first quotation I was thinking of comes from an article in
       the November 1998 issue of Microsoft Systems Journal,
       "Supporting Multilanguage Text Layout and Complex Scripts with
       Windows NT 5.0", by F. Avery Bishop, David C. Brown, and David
       M. Meltzer, pp. 57 - 70. On page 59 in that article, they give
       a glossary. These are two of the definitions given:

       Script: A collection of characters for displaying written text,
       all of which have a common characteristic that justifies their
       consideration as a distinct set. One script may be used for
       several different languages... and some written languages
       require multiple scripts (for example, Japanese... )...

       Writing system: The collection of scripts and orthography
       required to represent a given human language in visual media.

       Their definition of script is in agreement with mine. Their
       definition for writing system, though, is more in line with
       that given by Rick and Carl-Martin. When I first read the MSJ
       article over three months ago, I was struck most by the fact
       that they were making the important distinction between script
       and writing system, and I didn't take note of the way in which
       their definition of writing system disagrees from mine. So, my
       memory on the point currently under discussion was in error.
       Again, CJK wasn't a big factor in my thinking, but it very
       obviously has been an important consideration for Microsoft.

       The second source is a manuscript by Richard Sproat (to appear,
       "A Computation Theory of Writing Systems"):

       "...we will use the terms 'script', 'orthography' and 'writing
       system', in their conventional senses as follows. A 'script' is
       just a set of distinct marks conventionally used to represent
       the written form of one or more languages: crucially, one can
       speak of a script without implying its use for a given
       language... On the other hand, a writing system is a script
       used to represent a particular language. Thus 'writing system'
       implies 'writing system for a given language'. We will use the
       terms 'orthography' and 'writing system' interchangeably..."

       (Sproat adds a note here about distinctions between orthography
       and, say, technography as discussed by Mountford which I
       referred to in my original message.) It seems to me that
       Richard's definitions are precisely in agreement with mine. I
       note with interest that Richard's work looks at a variety of
       writing systems and scripts, including Russian, Belorussian,
       Korean, Chinese, Japanese, Devanagari, Pahawh Hmong, Ancient
       Egyptian, Aramaic. While he has considered CJ and K, he is
       attempting to cover scripts and writing systems in full
       generality. While his work is theoretical, I think there is
       important similarity with the practical work we are attempting
       to do in that we are developing very general implementations
       that can deal with any case.

       Peter Constable
       Non-Roman Script Initiative, SIL



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:44 EDT