Below my signature is the first draft (after a short internal review)
of what might become a proposal to the Unicode Technical Committee.
We are posting it here to solicit comments from the Unicode community
before pursuing it further.
-----------------------------------------------------------------------------
mleisher@crl.nmsu.edu
Mark Leisher "The trick is not gaining the knowledge,
Computing Research Lab but surviving the lessons."
New Mexico State University -- "Svaha," Charles de Lint
Box 30001, Dept. 3CRL
Las Cruces, NM 88003
-----------------------------------------------------------------------------
Adding Embedded Language Identifiers to Unicode Text
Daniel Wood, Mark Davis and Mark Leisher
dwood@crl.nmsu.edu
madavis@crl.nmsu.edu
mleisher@crl.nmsu.edu
Computing Research Lab
New Mexico State University
DATE: September 1, 1995
CHANGE DATE: September 4, 1995
ABSTRACT:
Although the Unicode standard does not define any kind of language
identification approach, in many areas of multilingual text
processing knowing the language of the text is useful.
If the language of a sequence of text is known, then handling language
changes in the application becomes easier. This allows efficient
implementation of switching needed for activities such as
spell-checking and word segmentation in multilingual documents.
Currently, language identification is done with higher level
protocols, often implemented as markup on the text (e.g. RTF, SGML).
This makes language identification application-specific. Importing
text containing multiple languages from different applications
complicates the language identification process.
In the interest of promoting a common approach to embedding language
identifiers in Unicode text, we offer the following proposal.
BODY:
0. Intro
To begin, we need an enumeration of human languages to work from--
one that is available in electronic form, fairly complete (to make
the researchers happy) and potentially subject to a standardization
process.
We first looked to ANSI/NISO Z39.53-1994 (Codes for the
Representation for Languages for Information Interchange) which
defines three letter codes for approximately 400 languages.
We felt that ANSI/NISO Z39.53-1994 was not sufficient for research
purposes, and propose that the reference list of languages be based
on the Ethnologue (World Genetic Tree) provided by the Summer
Institute of Linguistics (available from
ftp.sil.org:DSK_PUBLIC:[FILESERV.FTP]). This list provides three
letter codes for 6790 languages.
The sections below provide the technical details of our proposed
approach.
1. Terminology
Language identifier - some unsigned 16-bit integer value
2. Define a new Private Use Area Unicode block E100-E10F with sixteen
codepoints:
START END BLOCK NAME
----- --- ----------
E100 E10F LANGUAGE ID BITS
The choice of this range of the Private Use Area was based entirely
on our particular usage of the Private Use Area. The range may
well conflict with current usage by others.
If this proposal generates enough interest, to avoid conflicts with
current usage, we are interested in recommendations to move these
codepoints to some other area.
3. Define 16 new codepoints in the following fashion:
CODE NAME
---- ----
E100 LANGUAGE ID BIT ZERO
E101 LANGUAGE ID BIT ONE
E102 LANGUAGE ID BIT TWO
E103 LANGUAGE ID BIT THREE
E104 LANGUAGE ID BIT FOUR
E105 LANGUAGE ID BIT FIVE
E106 LANGUAGE ID BIT SIX
E107 LANGUAGE ID BIT SEVEN
E108 LANGUAGE ID BIT EIGHT
E109 LANGUAGE ID BIT NINE
E10A LANGUAGE ID BIT TEN
E10B LANGUAGE ID BIT ELEVEN
E10C LANGUAGE ID BIT TWELVE
E10D LANGUAGE ID BIT THIRTEEN
E10E LANGUAGE ID BIT FOURTEEN
E10F LANGUAGE ID BIT FIFTEEN
4. Algorithm:
While scanning text, for each contiguous group of codepoints in the
"LANGUAGE ID BITS" block, construct the language identifier by
setting the bit specified by the codepoint in an unsigned 16-bit
integer variable.
Sample C pseudo-code:
unsigned short c, mask;
mask = 0;
while (c >= 0xe100 && c <= 0xe10f) {
mask |= (1 << (c - 0xe100));
c = next_character();
}
/*
* If the mask is 0, then no language id bit codepoints
* were encountered.
*/
if (mask != 0)
change_to_language(mask);
5. Technical benefits
A. Only sixteen codepoints are used.
B. Construction of a language id from a stream of Unicode text is
efficient and can be done the same way while scanning text
forward or backward.
C. Generation of a language id bit codepoint stream is simply a
matter of iterating through sixteen bits of an integer.
D. The ordering of the language id bit codepoints for the Unicode
Character Equivalency algorithm is implicit.
6. Representation efficiency
A. Possible bit allocation
Given that there are currently 6790 languages specified in the
Ethnologue, we only really need thirteen bits to represent
language identifiers. However, since no atomic integer types
are thirteen bits in size, we went ahead and assumed sixteen so
there are up to 65536 allowable language identifiers.
Since there are sixteen of the bit specifiers, it is possible
that up to sixteen codepoints have to be processed. This can be
reduced in the vast majority of cases by intelligently assigning
integer values to languages that frequently appear in on-line
text.
The frequently appearing languages would be assigned integer
values which have a minimum number of non-zero bits, each
non-zero bit represented by one codepoint.
In addition, related languages could be clustered, with exemplar
languages (determined by appearance frequency in common corpora)
assigned to numbers with mostly zero bits and related languages
assigned to numbers with increasingly denser quantities of
non-zero bits. This strategy would both reduce the average
quantity of language-identifier codes appearing in text and
maintain similarities between the codes for related languages.
At this time, we have not done any research on identifying
"frequently appearing" languages for the purposes of improving
the efficiency of this approach.
B. Possible representation problems
This approach is susceptible to problems when corruption of the
text occurs. Reconstruction of missing language id bit
codepoints would be very difficult. But this problem can occur
in other multi-code approaches as well, and from our viewpoint,
the risk is deemed acceptable.
There may be a cost in efficiency for programming languages that
have little or no bit-level manipulation capability.
7. Application efficiency
If an application relies exclusively on these embedded codes, a
cost in efficiency appears when an interactive language switch
occurs. The application must scan backward to determine the active
language to compare with the language being requested.
In cases where a document contains a significant amount of text in
one language and a small amount of text in another language is
being inserted, the cost due to the needed scan could be
prohibitive.
This potential cost can be easily avoided by an effective
implementation that allows fast lookup of the current language.
8. Semantics
As this is proposal is at an early stage, no real attempt has been
made to impose any fixed meaning on the sixteen bits made available
by this approach. If enough interest is generated, discussion of
the use of those sixteen bits would be needed.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:30 EDT