L2/02-073
To: | UTC |
Re: | Cambodian |
From: | Mark Davis |
Date: | 2002-02-08 |
We will have the opportunity to meet with Cambodian representatives during the next UTC meeting. Here are some thoughts on background that it would be useful to convey to them.
Process. ISO and Unicode have been working together on the standard
for about 10 years, and there has been opportunity for the Cambodian government
or individuals to be involved in the process. But we can understand that due to
circumstances beyond their control, the Cambodian government may not have been
in a position to be involved during this time, and that they may be unhappy with
the situation. What we have to do is to sit down and discuss how features of the
Cambodian language that can be accommodated by appropriate additions to the
Unicode standard.
Size. It has been said that the COENG model results in a 20% increase in
size for Cambodian. This may or not be the case with plain text; one would have
to see the figures. In any event, the Unicode standard does not attempt to be a
compression standard. Encoding all possible Indic clusters, for example, would
compress text significantly, but that is not done. And best is to look not at
raw byte proportions per character, but at real UTF-8 text with equivalent
translated content. For example, here are a number of translated pages
containing the same textual content, with their byte-counts. (They are all
linked from http://www.unicode.org/unicode/standard/WhatIsUnicode.html)
09,618 s-chinese.html 09,682 t-chinese.html 10,110 esperanto.html 10,279 maltese.html 10,475 icelandic.html 10,632 czech.html 10,660 welsh.html 10,808 danish.html 10,856 swedish.html 10,863 polish.html 10,864 spanish.html 10,955 interlingua.html 11,000 italian.html 11,038 lithuanian.html 11,044 portuguese.html 11,096 romanian.html 11,106 german.html 11,119 arabic.html 11,134 korean.html 11,281 french.html 11,462 japanese.html 13,892 persian.html 14,808 WhatIsUnicode.html (English, but with additional content) 14,028 greek.html 14,632 russian.html 15,218 hindi.html 15,853 deseret.html 16,069 georgian.html
It would be instructive to produce a Unicode Cambodian translation for that page, simply to compare it to the above figures. Yet even the comparisons here are not really representative. The actual size of the representation of text in modern documents is completely swamped by the size of graphics, sound, and structure (e.g. HTML/XML code).
Constraints. There are a few important constraints on changes in the standard that the Cambodian representatives should be aware of:
National Standard. The Cambodians are free to develop their own
standards, and register it with IANA. This would at least provide a well defined
mechanism for converting Unicode to and from that standard. (Part of the IANA
registration would be supplying the mapping to and from the equivalent Unicode
representation).
However, this may be counterproductive. Much of the infrastructure of computing
worldwide is already Unicode, and the rest is moving in that direction. The
Unicode standard is extensively implemented, and forms the basis for all modern
textual representation, for many crucial standards such as XML and Java, and for
major OSs and application suites such as Windows XP and Office XP. For more
information, see:
http://www.unicode.org/unicode/standard/WhatIsUnicode.html
http://www.unicode.org/unicode/standard/where/
While the Cambodian's desire for their own code page is understandable, having a separate code page may actually impede the progress of adapting software to work with Cambodian. We hope that further discussions can help to make it clear whether or not a national code page would be of overall benefit to Cambodia.
Representation. In any event, the bottom line is: Is there any Cambodian text that cannot be represented with the characters and COENG model as expressed in the Unicode standard? To the best of our knowledge, modern text can be represented. If there are missing characters, then those can be added, as long as they do not duplicate a way of representing the text that is already encoded.
As long as text can be represented, there is no barrier to any usage of Cambodian, including rendering, spell-checking, and sorting. Microsoft, for example, demonstrated that to Cambodian representatives at the last SC2 meeting.