From: Doug Ewell (dewell@adelphia.net)
Date: Sun Sep 17 2006 - 17:52:45 CDT
This sounds remarkably like the study by Steven Atkin and Ryan
Stansifer, quoted in UTN #14, which attempted to prove 8-bit legacy
encodings -- optimized for a single language or family of languages --
are superior to Unicode because they encode those languages in fewer
bytes than Unicode, and because a particular compression scheme
(Burrows-Wheeler) compresses all encodings roughly equally.
Better support for SCSU over the past 8 years or so, from Unicode and
from industry, might have been able to put these complaints to rest.
SCSU compresses most non-CJK text to 1 byte per character, and most CJK
text to 2 bytes per character, the same as legacy charsets. Because
SCSU was relegated to the realm of "a higher-level protocol" and Unicode
continued to be represented
until 2001 as primarily a 16-bit encoding, industry support for this
very useful encoding scheme never got off the ground.
I would add that the heading "English bias" perpetuates a common and
destructive myth. 8-bit legacy encodings exist that support dozens of
languages besides English. To the extent that C and database
development tools exhibit a "bias" (which the passage does not prove),
it is a bias in favor of 8-bit legacy encodings and not the English
language.
-- Doug Ewell Fullerton, California, USA http://users.adelphia.net/~dewell/ RFC 4645 * UTN #14 ----- Original Message ----- From: Don Osborn To: unicode@unicode.org Sent: Sunday, September 17, 2006 10:40 Subject: Unicode & space in programming & l10n A study published last year* mentioned the impact of Unicode’s space requirements in aspects of programming and localization. How big an issue is the “size” requirement of Unicode for programmers these days, in terms of its wider potential use? (Some short excerpts are appended after the citation). DZO Paolillo, John. 2005. “Language Diversity on the Internet.” In Paolillo, John, Daniel Pimienta, Daniel Prado, et al, eds. Measuring linguistic diversity on the Internet. A collection of papers. Montreal: UNESCO. (CI.2005/WS/06) http://unesdoc.unesco.org/images/0014/001421/142186e.pdf p. 47 (in the context of bias against localizing in diverse scripts): Technical bias arises in encoding schemes for text such as Unicode UTF-8, which causes text in a non-roman script to require two to three times more space than comparable text in a roman script. Here, the motivation stems from issues of compatibility between older roman-based systems and more recent Unicode systems. p. 73 (in discussion of encoding & multilingual ICT) In its most basic form, UTF-32, Unicode text occupies four times as much space as the same text in ASCII. Many software developers have assumed that users would not want this penalty for multilingual text, particularly if computer use occurs mainly in monolingual contexts.24 Unicode offers other variable-length encodings that are more effi cient, but the space costs are passed on to non-roman scripts which are forced to consume more space. Although data storage costs have dropped considerably in the last decade, enough to make Unicode less of a problem, handling Unicode still substantially complicates the software developer’s task, since most applications require inter-operability with ASCII. In addition, the larger sizes of Unicode documents carry costs for transmission, compression and decompression, and these costs are enough of a penalty to discourage use of Unicode in some contexts. p. 74 (English bias in markup & programming languages) Unfortunately, many commonly-used programming languages such as C do not yet offer standard support for Unicode.25 A growing number of languages designed for Web-based applications do (examples include Java, JavaScript, Perl, PHP, Python, and Ruby, all of which are widely adopted), but other systems, such as database software, vary more in their support for Unicode. [Footnote 25 The International Components for Unicode website offers an open-source C library that assists in Unicode support (http://oss.software.ibm.com/icu/).]
This archive was generated by hypermail 2.1.5 : Sun Sep 17 2006 - 18:09:18 CDT