L2/02-073

To: UTC
Re: Cambodian
From: Mark Davis
Date: 2002-02-08

We will have the opportunity to meet with Cambodian representatives during the next UTC meeting. Here are some thoughts on background that it would be useful to convey to them.

Process. ISO and Unicode have been working together on the standard for about 10 years, and there has been opportunity for the Cambodian government or individuals to be involved in the process. But we can understand that due to circumstances beyond their control, the Cambodian government may not have been in a position to be involved during this time, and that they may be unhappy with the situation. What we have to do is to sit down and discuss how features of the Cambodian language that can be accommodated by appropriate additions to the Unicode standard.

Size. It has been said that the COENG model results in a 20% increase in size for Cambodian. This may or not be the case with plain text; one would have to see the figures. In any event, the Unicode standard does not attempt to be a compression standard. Encoding all possible Indic clusters, for example, would compress text significantly, but that is not done. And best is to look not at raw byte proportions per character, but at real UTF-8 text with equivalent translated content. For example, here are a number of translated pages containing the same textual content, with their byte-counts. (They are all linked from http://www.unicode.org/unicode/standard/WhatIsUnicode.html)

09,618            s-chinese.html
09,682            t-chinese.html
10,110            esperanto.html
10,279            maltese.html
10,475            icelandic.html
10,632            czech.html
10,660            welsh.html
10,808            danish.html
10,856            swedish.html
10,863            polish.html
10,864            spanish.html
10,955            interlingua.html
11,000            italian.html
11,038            lithuanian.html
11,044            portuguese.html
11,096            romanian.html
11,106            german.html
11,119            arabic.html
11,134            korean.html
11,281            french.html
11,462            japanese.html
13,892            persian.html
14,808            WhatIsUnicode.html (English, but with additional content)
14,028            greek.html
14,632            russian.html
15,218            hindi.html
15,853            deseret.html
16,069            georgian.html

It would be instructive to produce a Unicode Cambodian translation for that page, simply to compare it to the above figures. Yet even the comparisons here are not really representative. The actual size of the representation of text in modern documents is completely swamped by the size of graphics, sound, and structure (e.g. HTML/XML code).

Constraints. There are a few important constraints on changes in the standard that the Cambodian representatives should be aware of:

National Standard. The Cambodians are free to develop their own standards, and register it with IANA. This would at least provide a well defined mechanism for converting Unicode to and from that standard. (Part of the IANA registration would be supplying the mapping to and from the equivalent Unicode representation).

However, this may be counterproductive. Much of the infrastructure of computing worldwide is already Unicode, and the rest is moving in that direction. The Unicode standard is extensively implemented, and forms the basis for all modern textual representation, for many crucial standards such as XML and Java, and for major OSs and application suites such as Windows XP and Office XP. For more information, see:

http://www.unicode.org/unicode/standard/WhatIsUnicode.html
http://www.unicode.org/unicode/standard/where/

While the Cambodian's desire for their own code page is understandable, having a separate code page may actually impede the progress of adapting software to work with Cambodian. We hope that further discussions can help to make it clear whether or not a national code page would be of overall benefit to Cambodia.

Representation. In any event, the bottom line is: Is there any Cambodian text that cannot be represented with the characters and COENG model as expressed in the Unicode standard? To the best of our knowledge, modern text can be represented. If there are missing characters, then those can be added, as long as they do not duplicate a way of representing the text that is already encoded.

As long as text can be represented, there is no barrier to any usage of Cambodian, including rendering, spell-checking, and sorting. Microsoft, for example, demonstrated that to Cambodian representatives at the last SC2 meeting.