Re: 1 in 1000

From: Edward Cherlin (
Date: Mon Apr 24 2000 - 18:01:00 EDT

At 6:29 AM -0800 4/24/2000, Elliotte Rusty Harold wrote:
>Is the following statement accurate?
>Probably less than one person in a thousand today speaks a language that
>cannot be reasonably represented in Unicode.
>Can anyone be more accurate than that? If the number is higher than 1 in
>a 1000, what scripts still need to be encoded to get the ratio below 1
>in a 1000? If it's already much less than 1 in a 1000, how low is it
>approximately? 1 in 10,000? 100,000? 1,000,000?
>| Elliotte Rusty Harold | | Writer/Programmer |
>| Java I/O (O'Reilly & Associates, 1999) |
>| |
>| |
>| Read Cafe au Lait for Java news: |
>| Read Cafe con Leche for XML news: |

We went over that ground last year, and didn't find any clearly true
and clearly understandable statements that we could make about
individuals, publications, languages, or writing systems. I have made
statements about the number of writing systems (including e.g. math
and APL) partially supported in Unicode (none is *completely*
supported for all of its languages and users), and about the fraction
(with large error bars) of commercial, academic, and government use
that this covers, while also noting scripts (200+), languages
(6000+), and uses (not quantifiable) not properly supported yet, or
not included at all. Naturally, all of these statements applied only
to the version of Unicode defined at the time, and to available

I could do the same for Unicode 3.0, if I had the time, which I
don't. I believe it comes to 30+ writing systems, several hundred
languages, and nearly all commercial and government publication, and
of academic publications not involving non-covered languages, and
about 0% of academic publication using non-covered language material.

The question at hand is how much non-covered material there is,
measured perhaps by source volume or production volume, or compared
with speech in the languages under study. The more interesting
question, to me, is how much there would be if the scripts and
languages *were* covered, not just in the standard, but in operating
systems (Windows 2000, Mac OS X, various UNIX products, Linux, and
others), programming languages (APL2, Java, Perl, C/C++) publishing
applications (FrameMaker doesn't support Unicode at all), and Web
browsers (Alis Tango, Accent, Mozilla/Netscape).

A more useful measure yet might be the number of major literature
publishing projects using Unicode. Thesaurus Linguae Graecae covers
essentially all of Classical Greek, but not in Unicode. Project
Gutenberg can be counted as UTF-8, since it is plain 7-bit ASCII. I
know of projects in European languages, Arabic, Hebrew, Pali
(Devanagari, Sinhala, Thai and Latin scripts), Sanskrit, Tibetan,
Chinese, Japanese, and Korean. None used Unicode the last time I
looked, although I am sure that many of them are looking at Unicode
conversion for the future.

The number of organizations committed to creating and/or publishing
documents in Unicode is another potentially useful measure. I know
about the U.S. Defense Language Institute and Voice of America, but I
haven't been researching this question since shortly after they
converted from Xerox Documenters to Unicode software on PCs.

Edward Cherlin
"A knot!" exclaimed Alice. "Oh, do let me help to undo it."
Alice in Wonderland

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT