Forwarded message follows:
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
COMMENTS BY ACIP ON UNICODE /ISO10646 BMP ENCODING OF TIBETAN
(July 1996)
Prepared by: Robert Chilton, Technical Manager
The Asian Classics Input Project (ACIP)
INTRODUCTION. Since its inception in 1987, ACIP has input over 1000
titles of classical Tibetan literature, currently totaling some 45,000
pages of text. ACIP's computerized Tibetan language database is by far
the largest in the world. ACIP has also created catalogs of Tibetan-
language materials, most notably a catalog of the Russian Academy of
Science's massive collection in St Petersburg--which, at 34,000 entries,
is now about one-fifth completed. ACIP's director, Mr. Michael Roach,
has served as a consultant to the U.S. Library of Congress on matters
concerning Tibetan language materials.
Overall, ACIP is pleased with the Unicode proposal of September 1995
(N1255, PDAM-6). Given the number and scope of changes made and
proposed during the lead-up to the JTC1/SC2/WG2 meeting in June 1995,
congratulations are due to all participants for settling on a thoroughly
reasonable proposal.
ACIP wonders whether this proposal is being rushed too quickly. Many of
the Tibetan experts we know have only recently seen the current
proposal; and they have no obvious means by which to make their views
known. Some period of public comment by experts in the field seems
appropriate. Perhaps a second PDAM for Tibetan is prudent?
That said, ACIP reaffirms our view that Tibetan as presented in document
N1255 is generally adequate for our purposes of encoding and processing
classical Tibetan (Choegay), but with reservations as noted below.
ACIP HAS TWO MAIN CONCERNS REGARDING UNICODE TIBETAN:
1. That the glyph registry be complete enough to encode all of our
database.
2. That the structure of the code table support lexical processing,
e.g., conversion and sorting, of our materials. Although sorting is not
a Unicode concern, it is of vital importance to indexing tools and to
the work of librarians and bibliographers. Where simple steps can be
taken to support sorting, such measures should be adopted.
1. GLYPH REGISTRY.
a. The non-abbreviated forms of subscribed WA, YA, and RA should be
encoded, but separate from the sequence of normal subscribed forms.
Within the sequence of subscribed letters, subscribed WA, YA, and RA
should appear in their normal abbreviated forms (wazur, yata, rata).
Rationale: Both abbreviated and non-abbreviated forms of these
subscribed letters can appear in the same document (ACIP can provide
examples). When subscribed to RA, ACIP encodes these pairs as RVA, RYA,
RRA (abbreviated forms) and RWA, R+YA, R+RA (non-abbreviated forms). We
note that R+Y+YA appears with some frequency.
b. Other glyphs: In comparing the current proposal (PDAM 6, SEPT 95)
with past proposals, it appears that two glyphs were (inadvertently?)
omitted: the triple-x ("TIBETAN SIGN THREE DENA") and the large-X
("TIBETAN MARK KURUKA"). Two additional glyphs are candidates for
encoding: the dachey (crescent moon) and the nada (flame). These are
explained as distinct lexical elements in, for instance, the Mongolian
national symbol / Kalachakra symbol and might well be written separately
in such explanations. These glyphs are well known and thus no
illustration is necessary.
2. LEXICAL PROCESSING CONCERNS. Conversion of Tibetan materials from
existing formats to Unicode Tibetan; and sorting of Tibetan in Tibetan
sort order and Sanskrit sort order can be achieved with a minimum of
difficulty if the following provisions are met:
ACIP strongly recommends:
a. A code position must be reserved for the invisible inherent vowel A
just prior to the lengthened vowel A (position 0F70 in the SEPT 95
proposal).
b. One code position each must be reserved in both full and subscribed
letter sequences between JA and NYA (positions 0F48 and 0F98 in the SEPT
95 proposal) for Sanskritic recode from DZHA.
c. The consonant and vowel series should remain in (mostly) Sanskritic
order, as in the SEPT 95 proposal, since such ordering greatly
facilitates sorting in Sanskrit order and has no affect on sorting in
Tibetan. It is very helpful to have most or all non-alphabetic (non-
lexical) glyphs encoded in code positions prior to KA. The vowels
should maintain their current position--following the full letters and
prior to the subscribed letters. A constant offset between the full
letters and their subscribed counterparts should be maintained.
ACIP observes and suggests:
d. Lexical processing of Dzongkha (Bhutanese) will be greatly
facilitated by the addition of an invisible TSEG (to mark the end of a
lexical unit).
e. An invisible tag marking the boundary between the lexical prescript
and the lexical root will likely be inserted during lexical processing;
Unicode may wish to define this code position explicitly rather than
leaving it up to the various applications developers to define, each in
their own way perhaps.
f. Given the likelihood of additional glyphs appearing after adoption of
the current proposal, Unicode may wish to leave empty code positions
prior to the full letter KA (or else prior to the first encoded Tibetan
character). ACIP does not understand why the empty code positions
follow full letter KSHRA since it is not likely that many new quasi-
alphabetical (lexical) glyphs will be proposed for inclusion. It seems
sensible to shift the entire letter sequence of KA through KSHRA down
six code positions, thus freeing up code positions prior to KA.
Similarly, it may be preferable to shift the entire alphabetic (lexical)
section--consisting of the two consonant sequences and the intervening
vowel & sundry sequence--to the end of the reserved code space, thus
freeing up more open code positions prior to the lexical characters.
3. MINOR ISSUES.
a. Some of the character names, such as the reversed letters, need
editing. As a note, where Wylie transliteration (lowercase) uses tsa
and tsha, ACIP transliteration (uppercase) uses TZA and TSA.
b. For ease of processing during rendering, marks that apply to an
entire syllable such as 0F35, 0F37, 0F86, 0F87 should be grouped
together, in order to support range checking.
c. ACIP does not understand the rationale behind encoding the
precomposed characters at positions 0F00, 0F02, and 0F03.
d. Blank space in Tibetan obeys very different conventions from blank
space in most roman scripts. ACIP wonders if it would be useful and
appropriate to define a Tibetan version of blank space--which is more
properly BLANK or GAP or HORIZONTAL SEPARATION--since, unlike
conventional <SPACE>, it is not really unitary nor additive.
4. APPENDICES (available immediately upon request)
APPENDIX A: COMMENTS ON SORTING UNICODE TIBETAN
Abstract: Candidate sort orders for Tibetan include: Choegay,
Sanskritic, and Dzongkha (the national language of Bhutan). Choegay
sort order can be accomplished for mixed standard and non-standard
orthographies by identifying the prescript, if any, and then applying a
conventional three level sort. The three levels are: alphabetic,
diacritical variations, and case variations.
APPENDIX B: ALGORITHM FOR CONVENTIONAL TIBETAN (CHOEGAY) SORT ORDER
Abstract: A sketch algorithm is presented for ordering both standard
and non-standard (i.e., Sanskritic and other foreign-origin)
orthographies within a single sort sequence that follows Choegay sort
order.
------------------------------------------------------------------
Robert R. Chilton, Technical Manager
The Asian Classics Input Project (ACIP)
New York Area Office: 47 East Fifth Street Howell, NJ 07731
Tel: 908-364-1824 Fax: 908-901-5940 Email: acip@well.com
------------------------------------------------------------------
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
-- Christopher J Fynn <cfynn@sahaja.demon.co.uk> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT