From: Don Osborn (dzo@bisharat.net)
Date: Sun Sep 17 2006 - 12:40:12 CDT
A study published last year* mentioned the impact of Unicode's space
requirements in aspects of programming and localization. How big an issue is
the "size" requirement of Unicode for programmers these days, in terms of
its wider potential use? (Some short excerpts are appended after the
citation). DZO
Paolillo, John. 2005. "Language Diversity on the Internet." In Paolillo,
John, Daniel Pimienta, Daniel Prado, et al, eds. Measuring linguistic
diversity on the Internet. A collection of papers. Montreal: UNESCO.
(CI.2005/WS/06) http://unesdoc.unesco.org/images/0014/001421/142186e.pdf
p. 47 (in the context of bias against localizing in diverse scripts):
Technical bias arises in encoding schemes for text such as Unicode UTF-8,
which causes text in a non-roman script to require two to three times more
space than comparable text in a roman script. Here, the motivation stems
from issues of compatibility between older roman-based systems and more
recent Unicode systems.
p. 73 (in discussion of encoding & multilingual ICT)
In its most basic form, UTF-32, Unicode text occupies four times as much
space as the same text in ASCII. Many software developers have assumed that
users would not want this penalty for multilingual text, particularly if
computer use occurs mainly in monolingual contexts.24 Unicode offers other
variable-length encodings that are more effi cient, but the space costs are
passed on to non-roman scripts which are forced to consume more space.
Although data storage costs have dropped considerably in the last decade,
enough to make Unicode less of a problem, handling Unicode still
substantially complicates the software developer's task, since most
applications require inter-operability with ASCII. In addition, the larger
sizes of Unicode documents carry costs for transmission, compression and
decompression, and these costs are enough of a penalty to discourage use of
Unicode in some contexts.
p. 74 (English bias in markup & programming languages)
Unfortunately, many commonly-used programming languages such as C do not yet
offer standard support for Unicode.25 A growing number of languages designed
for Web-based applications do (examples include Java, JavaScript, Perl, PHP,
Python, and Ruby, all of which are widely adopted), but other systems, such
as database software, vary more in their support for Unicode.
[Footnote 25 The International Components for Unicode website offers an
open-source C library that assists in Unicode support
(http://oss.software.ibm.com/icu/).]
This archive was generated by hypermail 2.1.5 : Sun Sep 17 2006 - 12:51:19 CDT