Embedded language ID proposal

From: Mark Leisher (mleisher@crl.nmsu.edu)
Date: Wed Sep 06 1995 - 13:53:45 EDT


Below my signature is the first draft (after a short internal review)
of what might become a proposal to the Unicode Technical Committee.

We are posting it here to solicit comments from the Unicode community
before pursuing it further.
-----------------------------------------------------------------------------
mleisher@crl.nmsu.edu
Mark Leisher "The trick is not gaining the knowledge,
Computing Research Lab but surviving the lessons."
New Mexico State University -- "Svaha," Charles de Lint
Box 30001, Dept. 3CRL
Las Cruces, NM 88003
-----------------------------------------------------------------------------

         Adding Embedded Language Identifiers to Unicode Text

               Daniel Wood, Mark Davis and Mark Leisher

                          dwood@crl.nmsu.edu
                         madavis@crl.nmsu.edu
                        mleisher@crl.nmsu.edu

                        Computing Research Lab
                     New Mexico State University

DATE: September 1, 1995
CHANGE DATE: September 4, 1995

ABSTRACT:

Although the Unicode standard does not define any kind of language
identification approach, in many areas of multilingual text
processing knowing the language of the text is useful.

If the language of a sequence of text is known, then handling language
changes in the application becomes easier. This allows efficient
implementation of switching needed for activities such as
spell-checking and word segmentation in multilingual documents.

Currently, language identification is done with higher level
protocols, often implemented as markup on the text (e.g. RTF, SGML).
This makes language identification application-specific. Importing
text containing multiple languages from different applications
complicates the language identification process.

In the interest of promoting a common approach to embedding language
identifiers in Unicode text, we offer the following proposal.

BODY:

0. Intro

   To begin, we need an enumeration of human languages to work from--
   one that is available in electronic form, fairly complete (to make
   the researchers happy) and potentially subject to a standardization
   process.

   We first looked to ANSI/NISO Z39.53-1994 (Codes for the
   Representation for Languages for Information Interchange) which
   defines three letter codes for approximately 400 languages.

   We felt that ANSI/NISO Z39.53-1994 was not sufficient for research
   purposes, and propose that the reference list of languages be based
   on the Ethnologue (World Genetic Tree) provided by the Summer
   Institute of Linguistics (available from
   ftp.sil.org:DSK_PUBLIC:[FILESERV.FTP]). This list provides three
   letter codes for 6790 languages.

   The sections below provide the technical details of our proposed
   approach.

1. Terminology

   Language identifier - some unsigned 16-bit integer value

2. Define a new Private Use Area Unicode block E100-E10F with sixteen
   codepoints:

   START END BLOCK NAME
   ----- --- ----------
   E100 E10F LANGUAGE ID BITS

   The choice of this range of the Private Use Area was based entirely
   on our particular usage of the Private Use Area. The range may
   well conflict with current usage by others.

   If this proposal generates enough interest, to avoid conflicts with
   current usage, we are interested in recommendations to move these
   codepoints to some other area.

3. Define 16 new codepoints in the following fashion:

   CODE NAME
   ---- ----
   E100 LANGUAGE ID BIT ZERO
   E101 LANGUAGE ID BIT ONE
   E102 LANGUAGE ID BIT TWO
   E103 LANGUAGE ID BIT THREE
   E104 LANGUAGE ID BIT FOUR
   E105 LANGUAGE ID BIT FIVE
   E106 LANGUAGE ID BIT SIX
   E107 LANGUAGE ID BIT SEVEN
   E108 LANGUAGE ID BIT EIGHT
   E109 LANGUAGE ID BIT NINE
   E10A LANGUAGE ID BIT TEN
   E10B LANGUAGE ID BIT ELEVEN
   E10C LANGUAGE ID BIT TWELVE
   E10D LANGUAGE ID BIT THIRTEEN
   E10E LANGUAGE ID BIT FOURTEEN
   E10F LANGUAGE ID BIT FIFTEEN

4. Algorithm:

   While scanning text, for each contiguous group of codepoints in the
   "LANGUAGE ID BITS" block, construct the language identifier by
   setting the bit specified by the codepoint in an unsigned 16-bit
   integer variable.

   Sample C pseudo-code:

     unsigned short c, mask;

     mask = 0;

     while (c >= 0xe100 && c <= 0xe10f) {
       mask |= (1 << (c - 0xe100));
       c = next_character();
     }

     /*
      * If the mask is 0, then no language id bit codepoints
      * were encountered.
      */
     if (mask != 0)
       change_to_language(mask);

5. Technical benefits

   A. Only sixteen codepoints are used.

   B. Construction of a language id from a stream of Unicode text is
      efficient and can be done the same way while scanning text
      forward or backward.

   C. Generation of a language id bit codepoint stream is simply a
      matter of iterating through sixteen bits of an integer.

   D. The ordering of the language id bit codepoints for the Unicode
      Character Equivalency algorithm is implicit.

6. Representation efficiency

   A. Possible bit allocation

      Given that there are currently 6790 languages specified in the
      Ethnologue, we only really need thirteen bits to represent
      language identifiers. However, since no atomic integer types
      are thirteen bits in size, we went ahead and assumed sixteen so
      there are up to 65536 allowable language identifiers.

      Since there are sixteen of the bit specifiers, it is possible
      that up to sixteen codepoints have to be processed. This can be
      reduced in the vast majority of cases by intelligently assigning
      integer values to languages that frequently appear in on-line
      text.

      The frequently appearing languages would be assigned integer
      values which have a minimum number of non-zero bits, each
      non-zero bit represented by one codepoint.

      In addition, related languages could be clustered, with exemplar
      languages (determined by appearance frequency in common corpora)
      assigned to numbers with mostly zero bits and related languages
      assigned to numbers with increasingly denser quantities of
      non-zero bits. This strategy would both reduce the average
      quantity of language-identifier codes appearing in text and
      maintain similarities between the codes for related languages.

      At this time, we have not done any research on identifying
      "frequently appearing" languages for the purposes of improving
      the efficiency of this approach.

   B. Possible representation problems

      This approach is susceptible to problems when corruption of the
      text occurs. Reconstruction of missing language id bit
      codepoints would be very difficult. But this problem can occur
      in other multi-code approaches as well, and from our viewpoint,
      the risk is deemed acceptable.

      There may be a cost in efficiency for programming languages that
      have little or no bit-level manipulation capability.

7. Application efficiency

   If an application relies exclusively on these embedded codes, a
   cost in efficiency appears when an interactive language switch
   occurs. The application must scan backward to determine the active
   language to compare with the language being requested.

   In cases where a document contains a significant amount of text in
   one language and a small amount of text in another language is
   being inserted, the cost due to the needed scan could be
   prohibitive.

   This potential cost can be easily avoided by an effective
   implementation that allows fast lookup of the current language.

8. Semantics

   As this is proposal is at an early stage, no real attempt has been
   made to impose any fixed meaning on the sixteen bits made available
   by this approach. If enough interest is generated, discussion of
   the use of those sixteen bits would be needed.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:30 EDT