From: Andrew C. West (andrewcwest@alumni.princeton.edu)
Date: Mon Dec 16 2002 - 08:40:22 EST
As promised, here are some questions on the encoding of Mongolian that have
arisen whilst writing an input method for the Mongolian script (the questions
are relevant to the Todo, Manchu and Sibe scripts as well, but I'll restrict
myself to Mongolian for the moment). I don't know if anyone is able to answer
all of my questions, but I hope that someone on the list will be able to give me
some much needed advice.
1. Documentation
Section 11.4 of the Unicode Standard notes that a group of experts from
Mongolia, China and the West are to publish a document called "User's Convention
for System Implementation of the International Standard on Mongolian Encoding"
which will explicitly define Mongolian character shaping behaviour in full. WG2
document N1980 (http://std.dkuug.dk/jtc1/sc2/WG2/docs/n1980.doc) also states
that Mongolian, Chinese and English versions of the "User's Convention" will be
prepared by Mongolia and China. I have been unable to locate this document on
the internet. Does it exist, and if so can it be made publicly available ?
Without the aid of such a document it seems almost impossible to correctly
implement the Unicode encoding of Mongolian.
In its stead I have been using the document "Traditional Mongolian Script in the
ISO/IEC 19646 and Unicode Standards" (UNU/IIST Report No. 170, August 1999)
written by Myatav Erdenechimeg, Richard Moore and Yumbayar Namsrai as a guide to
Mongolian character shaping behaviour. It seems to provide all the information I
would expect to see in the "User's Convention", but I am not sure how
authoritive this paper is, and what its relationship is to the "User's
Convention" (if any).
2. Free Variation Selectors
The Mongolian Free Variation Selectors (U+180B, U+180C and U+180D) are used to
distinguish variant graphic forms of the same positional forms of a character. I
would say that there are three cataegories of variant forms governed by the
variation selectors :
A. Non-contextual variants, such as variant forms of letters that are used in
foreign words (e.g. the use of a "reclining" letter D -- U+1833 + FVS1 -- in
foreign words), and graphic variations that are due to differences between
traditional and modern orthography. Such variants must be explicitly encoded by
use of the appropriate variation selector in order for the correct form to be
selected by the rendering engine.
B. Contextual variants that are determined by the overall composition of the
word in which they are found, such as the use of the long-toothed forms of the
letters OE and UE (U+1825/1826 + FVS1) in the first syllable of a word only, or
the use of the feminine form of the letter G (U+182D + FVS3) between consonants
or the letter I (which is neutral) in a feminine word. In these cases I would
imagine that it is too much to ask the rendering engine to work out the correct
variant form, and the correct variant should be explicitly encoded using the
appropriate variation selector.
C. Contextual variants that can be determined from their neighbouring letters,
such as the medial form of the letter G with two dots that is used before a
vowel (U+182D + FVS2), or the form of the letter A that is written with a
forward tail when occuring finally after the letters B, P, F and K (U+1820 +
FVS1). In these cases is it necessary to explicitly encode the variant form with
the appropriate variation selector ? The Standard says "For cases in which the
contextual sequence of basic letters is not sufficient for a rendering engine to
uniquely determine the appropriate glyph for a particular letter, additional
format characters are provided so that the typist may specify the desired
rendering". Should we assume that the rendering engine will correctly select the
dotted form of medial G before a vowel and the dotless form before a consonant,
or would it be wiser to explicitly encode the appropriate variation selector
anyway ?
3. Mongolian Vowel Selector
The Mongolian Vowel Selector (U+180E) is used to separate the vowels A and E
from certain preceding consonants (e.g. ...N + MVS + A = U+1828,180E,1820 ).
After MVS the vowels A and E use the forward tail variant which is physically
offset from the preceding consonant by narrow whitespace. These variant forms of
A and E are selected by the presence of a preceding MVS, and there appears to be
no need to to otherwise select the variant A or E by means of a variation
selector.
However, not only does the MVS affect the following A or E, but the preceding
consonant may also take a variant form when followed by an offset A or E. This
is the case for the letters N, Q, G, J, Y and W. The variant forms of these
letters when preceding an offset A or E are given in Unicode's Standardized
Variants document (N, Q, G, J and Y are given as medial variants, but W is given
as a final variant which is perhaps wrong). My question is, should the variant
form of the consonant preceding the offset A or E be explicitly encoded using
the appropriate variation selector, or is the presence of the following MVS
sufficient for the rendering engine to select the correct variant form ?
4. Variant forms of the Mongolian Birga
Appendix A of "Traditional Mongolian Script in the ISO/IEC 19646 and Unicode
Standards" lists four variant forms of the Mongolian Birga (U+1800) :
1st variant form = U+1800 + FVS1
2nd variant form = U+1800 + FVS2
3rd variant form = U+1800 + FVS3
4th variant form = U+1800 + ZWJ
Unicode's Standardized Variants document
(http://www.unicode.org/Public/UNIDATA/StandardizedVariants.html) does not list
any variants for the Mongolian Birga. Moreover, it warns "All combinations not
listed here are unspecified and are reserved for future standardization; no
conformant process may interpret them as standardized variants." This clearly
means that these Birga variants should not currently be recognised. But given
that the Birga does occur in a number of forms, either Unicode should define standardized
variants for them, or add some new characters to represent them.
Nevertheless, assuming that Appendix A of "Traditional Mongolian Script" is
correct in providing a mechanism for distinguishing four variant forms of the
Mongolian Birga, is it acceptable to use the ZWJ as a variant selector (as is
the case for the 4th variant Birga) ? It's usage here seems a little suspect to
me.
Andrew
This archive was generated by hypermail 2.1.5 : Mon Dec 16 2002 - 09:32:17 EST