Indic scripts: Meeting of UK experts active in ISO/TC46/SC2/WG12, 1997-03-12

From: John Clews (10646er@sesame.demon.co.uk)
Date: Mon Mar 16 1998 - 07:41:30 EST


Indic scripts: Meeting of UK experts active in ISO/TC46/SC2/WG12, 1997-03-12

A meeting was held at the Welcome Institute for the History of
Medicine, London, on 12 March 1998. Present were John Clews
(Chair, ISO/TC46/SC2, and interim convenor of ISO/TC46/SC2/WG12
(Transliteration of Indic scripts), Dr. Anthony Stone (Project
Leader, ISO/TC46/SC2/WG12 Transliteration of Indic scripts),
Dr. Dominik Wujastyk (Wellcome Institute for the History of
Medicine), and Dr. J. D. Smith (Faculty of Oriental Studies,
University of Cambridge). Anshuman Pandey (University of Washington)
sent apologies, but was involved in earlier and subsequent
discussions by email.

The aim was to progress work on ISO NP 15919: Transliteration of
Devanagari and related Indic scripts.

Details of some brief follow-up actions taken the same week are
also included below.

1. Review of minutes of ISO/TC46/SC2/WG12 Electronic meeting, 1997

Resolution 6.1 was approved during the electronic meeting in
July/August 1997:

     ISO/TC46/SC2/WG12 resolves that the transliteration tables in its
     draft standard for transliteration of Indic scripts will
     comprise a main table showing the main characters and their
     transliteration shared across all scripts.

     Tables will also show UCS identifiers (UCS IDs, from ISO/IEC
     10646, and characters from the Indian national standard
     ISCII-1991 as it is in such widespread use...

2. Possibilities for adding CSX coding to ISO NP 15919

During subsequent discussion on the Conv-Dev@elot.gr email
discussion list on transliteration of Indic scripts, it had been
requested that as well as columns showing UCS and ISCII-1991 coding,
a further column should be added showing the widely-used CSX
transliteration code values to ISO NP 15919, to make it more useful
to users, and to encourage its use.

In addition, the latest tables of ISO NP 15919 were reviewed to see
whether CSX should be expanded to reflect this (this expansion being
described as CSX+ below).

2.1 Description of CS, CSX, and CSX+

CS (Computer Sanskrit coded character set of transliteration
characters for Sanskrit) had been developed in 1990, as an overlay of
IBM code page 437. CS itself had not been widely used as there was a
need for additional characters. These additions were called CSX
(Computer Sanskrit Extended) and were widely used in international
information interchange, particularly in the academic community.

As a result of discussions regarding ISO NP 15919, some additional
precomposed transliteration characters were required: CSX plus these
additions were therefore called CSX+. The extended-Latin repertoire
of CSX+ and ISO NP 15919 will therefore be identical.

CSX has the advantage that it is already used, with this coding, on a
number of different platforms, including MS-DOS, OS/2, Windows,
Macintosh and Unix boxes.

2.2 Degree of use of CSX

The repertoire of Sanskrit text archives is larger than that for
Ancient Greek or Latin, and large bodies of Sanskrit and other Indic
text were available in CSX coding, for example the CSX versions of
the Mahabharata and Ramayana, available from John Smith's server
<http://bombay.oriental.cam.ac.uk/index.html>; and the text of the
Rgveda-samhita in CSX, accompanying the Harvard Oriental Series
publication Rig Veda: a metrically restored text with an introduction
and notes (edited by Barend A. van Nooten and Gary B. Holland), 1994.

Many academics worked equally with transliteration or the original
script, and the nature of CSX coding meant that 1:1 conversions
between transliterated text and Devanagari text were straightforward:
John Smith has now developed a CSX2ISCII program for this purpose for
example.

2.3 CSX and ISO coded character sets.

As ISO/IEC JTC1/SC2 and ISO/IEC JTC1/SC2/WG3 would meet in Seattle in
23-25 March 1997, and would discuss possibilities of allowing
ISO character set standards/registrations to include graphic
characters in columns 8 and 9 (where previously only control
characters could be used). The outcome of this point in Seattle would
affect how ISO/TC46/SC2/WG12 progressed adding CSX coding, and
how this coding could be referred to ISO NP 15919.

3. Review of latest transliteration tables for ISO NP 15919.

So that an appropriate draft could be produced in time for the Athens
meeting of ISO/TC46/SC2 in May 1998, there had been lengthy
discussion on the Conv-Dev@elot.gr email discussion list on
transliteration of Indic scripts concerning the optimum individual
transliteration characters, over a period of several weeks.

The consensus that had emerged on the Conv-Dev@elot.gr email
discussion list on transliteration of Indic scripts was discussed
today as draft version 2.02W: a further draft version 2.03 with
relatively minor changes would be produced by Tony Stone as a result
of today's discussions, on his web page. John Clews would develop a
draft text which would be more concise than earlier drafts submitted
to ISO/TC46/SC2.

The combined text and tables would be submitted to Evangelos
Melagrakis, Secretary of ISO/TC46/SC2, in time for May 1998 Athens
meeting of ISO/TC46/SC2, as a new draft CD for ISO NP 15919 for
discussion in Athens, by ISO/TC46/SC2/WG12 and ISO/TC46/SC2, and then
for voting by national member bodies of ISO/TC46/SC2.

4. Possibility of including information on transliteration of Urdu

Tony Stone had investigated Devanagari/Urdu correspondences, and only
the use of the following five nukta letters was stable: ka, kha, ga,
ja, pha, all with nukta. Otherwise correspondences were variable.
This variability meant that this aspect of work would be put on hold
until the May 1998 meeting of ISO/TC46/SC2 in Athens, when
discussions between ISO/TC46/SC2/WG11: Transliteration of
Perso-Arabic script and ISO/TC46/SC2/WG12: Transliteration of Indic
scripts would be useful.

        * * * * * * * *

Annex A: Information on ISCII

There had been discussions during 1997 about whether "ISCII 1997"
should replace the well-used ISCII (Indian Standard IS 13194:1991).
John Smith has found the following text on CDAC's WWW server, and it
is reproduced here for the benefit of any others who have been
unclear about this. It appears to bury "ISCII 1997" fairly
thoroughly. (DoE is the Department of Electronics.)

  The ISCII standard was re-affirmed in 1997 by BIS [the Bureau of
  Indian Standards] after an elapse of 5 years and it being in use
  without any problems. During the same year, DoE [Department of
  Electronics of the Government of India] had set-up a committee to
  look into some difficulties faced by a group of language
  researchers who proposed a change be done to suit nlp processing of
  Indian language texts. A white paper was presented by CDAC on the
  ISCII standard to this committee and is available for reading. This
  committee examined and finally concluded that IS13194:1991 will
  continue to be the Indian standard for all data/information
  interchange in Indian languages/scripts on a variety of operating
  systems and application software. Another layer would then be
  defined for nlp processing activities and will be used by such
  applications requiring pure consonant-vowel separated
  representation. With this it is clear that IS13194:1991 is the code
  to be used for all applications and systems requiring data storage
  and interchange in Indian scripts/languages. It has been used by
  IBM for PC-DOS, Apple for ILK, and several companies are developing
  products and solutions based on this representation.

        * * * * * * * *

Annex B: CSX+ coding for ISO NP 15919 matched against other codes

The following is the basic repertoire of accented characters
required: this table shows the availability of these characters in
existing international and de facto coded character set standards.

______________________________________________________________________
 (1) (2) (3) (4) (5) (6) (7)
UCS ID(s) UCS ID ALA USMARC/ 2.02W CSX+ ISO/IEC 10646
decomposed composed EBCDIC Z39.47 REF. CODE. CHARACTER NAME
----------------------------------------------------------------------

0061+0304; 0101; 8119; 61E5; 1-02; 224; SMALL A WITH MACRON
00E6; 00E6; 8A; B5; 1-03; 145; SMALL AE
00E6+0304; 01E3; 8A19; B5E5; 1-04; ---; SMALL AE WITH MACRON
0063+0302; 0109; 8308; 63E3; 4-09; 206; SMALL C WITH CIRCUMFLEX
0064+0323; 1E0D; 841C; 64F2; 1-37; 243; SMALL D WITH DOT BELOW
0065+0306; 0115; 852C; 65E6; 4-01; 213; SMALL E WITH BREVE
0065+0308; 00EB; 8511; 65E8; 1-14; 137; SMALL E WITH DIAERESIS
0065+0304; 0113; 8519; 65E5; 1-16; 185; SMALL E WITH MACRON
0067+0307; 0121; 870A; 67E7; 4-08; 205; SMALL G WITH DOT ABOVE
0068+032E; 1E2B; 8828; 68F9; 4-17; 220 SMALL H WITH BREVE BELOW
0068+0323; 1E25; 881C; 68F2; 3-02; 254; SMALL H WITH DOT BELOW
0068+0331; ; 886D; 68; 4-16; 219; SMALL H WITH MACRON BELOW
0069+0306; 012D; 892C; 69E6; 4-02; ---; SMALL I WITH BREVE
0069+0304; 012B; 8919; 69E5; 1-06; 227; SMALL I WITH MACRON
006B+0331; ; 926D; 6B; 3-03; 201; SMALL K WITH MACRON BELOW
006C+0323; 1E37; 931C; 6CF2; 2-23; 235; SMALL L WITH DOT BELOW
006C+0325; ; 931D; 6CF4; 1-12; 175; SMALL L WITH RING BELOW
006C+0325+0304; ; 931D19; 6CF4E5; 1-13; 176; SMALL L WITH RING BELOW AND MACRON
006D+0306; ; 942C; 6DE6; 2-40; 200; SMALL M WITH BREVE
006D+0310; ; 9418; 6DEF; 2-34; 193; SMALL M WITH CANDRABINDU
006D+0307; 1E41; 940A; 6DE7; 2-32; 167; SMALL M WITH DOT ABOVE
006E+0306; ; 952C; 6EE6; 2-39; 197; SMALL N WITH BREVE
006E+0302; ; 9508; 6EE3; 2-37: 195; SMALL N WITH CIRCUMFLEX
006E+0323;0306; ; 9528; 6EF9; 2-38; 196; SMALL N WITH DOT BELOW AND BREVE
006E+0308; ; 9511; 6EE8; 2-36; 194; SMALL N WITH DIAERESIS
006E+0307; 1E45; 950A; 6EE7; 1-27; 239; SMALL N WITH DOT ABOVE
006E+0323; 1E47; 951C; 6EF2; 1-41; 245; SMALL N WITH DOT BELOW
006E+0331; ; 956D; 6E--; 2-14; 173; SMALL N WITH MACRON BELOW
006E+0303; 00F1; 9529; 6EE4; 1-33; 164; SMALL N WITH TILDE
006F+0306; 014F; 962C; 6FE6; 4-03; 214; SMALL O WITH BREVE
006F+0308; 00F6; 9611; 6FE8; 1-18; 148; SMALL O WITH DIAERESIS
006F+0304; 014D; 9619; 6FE5; 1-20; 186; SMALL O WITH MACRON
0072+0331; ; 996D; 72--; 2-13; 159; SMALL R WITH MACRON BELOW
0072+0306; ; 992C; 72E6; 2-21; 191; SMALL R WITH BREVE
0072+0324; ; 991B; 72F3; 2-16; 187; SMALL R WITH DIAERESIS BELOW
0072+0323; 1E5B; 991C; 72F2; 1-38; 231; SMALL R WITH DOT BELOW
0072+0325; ; 991D; 72F4; 1-10; 157; SMALL R WITH RING BELOW
0072+0325+0304; ; 99191D; 72E5F4; 1-11; 174; SMALL R WITH RING BELOW AND MACRON
0073+0301; 015B; A20F; 73E2; 2-26; 247; SMALL S WITH ACUTE
0073+0323; 1E63; A21C; 73F2; 2-27; 249; SMALL S WITH DOT BELOW
0074+0323; 1E6D; A31C; 74F2; 1-35; 241; SMALL T WITH DOT BELOW
0075+0306; 016D; A42C; 75E6; 109/404;155; SMALL U WITH BREVE
0075+0304; 016B; A419; 75E5; 1-08; 229; SMALL U WITH MACRON
0079+0307; 1E8F; A80A; 79E7; 2-19; 188; SMALL Y WITH DOT ABOVE

 (1) Fully decomposed equivalent in UCS (ISO/IEC 10646 and Unicode)
     of the combination (characters in 4-digit hexadecimal separated
     by plus signs).

 (2) Fully composed UCS character equivalent (in 4-digit
      hexadecimal), or null string if there is no such equivalent.

 (3) The original ALA EBCDIC encoding of the characters (in 2-digit
      hexadecimal concatenated): ALA is American Library Association.
      Note that in columns (3) and (4) the accents typically precedes
      the base letter in the data stream: the byte order here is
      arranged to show base-letter coding similarities.

 (4) USMARC extended ASCII encoding of the characters (in 2-digit
      hexadecimal concatenated), largely maintained by the Library of
      Congress, and also closely related to the ANSI Z39.47-1985
      American National Standard for Information Sciences: Extended
      Latin Alphabet Coded Character Set for Bibliographic Use -
      New York: American National Standards Institution, 1985.
      Note that in columns (3) and (4) the accents typically precedes
      the base letter in the data stream: the byte order here is
      arranged to show base-letter coding similarities.

 (5) Provisional ISO NP 15919, version 2.02W reference number (a
      different number will appear in the final version of ISO NP
      15919: Transliteration of Devanagari and related Indic scripts)

 (6) CSX+ proposed code (decimal)

 (7) Name of the character combination. This is based on the name in
      UCS (ISO/IEC 10646 and Unicode) with the words "LATIN" and
      "LETTER" omitted for brevity. Combinations that do not have
      UCS equivalents are named by analogy to the UCS naming system.

        * * * * * * * *

John Clews

--
John Clews, SESAME Computer Projects, 8 Avenue Rd, Harrogate, HG2 7PG
Email: Converse@sesame.demon.co.uk;  tel: +44 (0) 1423 888 432
Chairman of ISO/TC46/SC2: Conversion of Written Languages;
Member of ISO/IEC/JTC1/SC22/WG20: Internationalization;
Member of CEN/TC304: Character Set Technology;
Member of ISO/IEC/JTC1/SC2: Character Sets.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:39 EDT