Re: Unicode & space in programming & l10n

From: Mark Davis ([email protected])
Date: Sun Sep 17 2006 - 20:42:23 CDT

Next message: Steve Summit: "Re: Unicode & space in programming & l10n"

Previous message: Doug Ewell: "Re: Unicode & space in programming & l10n"
In reply to: Doug Ewell: "Re: Unicode & space in programming & l10n"
Next in thread: Doug Ewell: "Re: Unicode & space in programming & l10n"
Reply: Doug Ewell: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Frankly, I think the reason why SCSU and BOCU never got a lot of traction is
related to #1 on my list. That is, in the vast majority of cases UTF-16 or
UTF-8 have storage characteristics that are good enough -- it's just not
really worth taking extra steps to squeeze out more. The only small-string
compression scheme to gain fairly wide acceptance, for different reasons, is
PunyCode. (All three of them are roughly comparable in compression ratio
over the samples I gave, although they have different other
characteristics.) Of course, ZIP and related compressions do a pretty good
job on any of these languages encoding in Unicode, so they can be applied to
reduce sizes for any and all of them, in appropriate circumstances.

Mark

On 9/17/06, Doug Ewell <[email protected]> wrote:
>
> This sounds remarkably like the study by Steven Atkin and Ryan
> Stansifer, quoted in UTN #14, which attempted to prove 8-bit legacy
> encodings -- optimized for a single language or family of languages --
> are superior to Unicode because they encode those languages in fewer
> bytes than Unicode, and because a particular compression scheme
> (Burrows-Wheeler) compresses all encodings roughly equally.
>
> Better support for SCSU over the past 8 years or so, from Unicode and
> from industry, might have been able to put these complaints to rest.
> SCSU compresses most non-CJK text to 1 byte per character, and most CJK
> text to 2 bytes per character, the same as legacy charsets. Because
> SCSU was relegated to the realm of "a higher-level protocol" and Unicode
> continued to be represented
> until 2001 as primarily a 16-bit encoding, industry support for this
> very useful encoding scheme never got off the ground.
>
> I would add that the heading "English bias" perpetuates a common and
> destructive myth. 8-bit legacy encodings exist that support dozens of
> languages besides English. To the extent that C and database
> development tools exhibit a "bias" (which the passage does not prove),
> it is a bias in favor of 8-bit legacy encodings and not the English
> language.
>
> --
> Doug Ewell
> Fullerton, California, USA
> http://users.adelphia.net/~dewell/ <http://users.adelphia.net/%7Edewell/>
> RFC 4645 * UTN #14
>
>
> ----- Original Message -----
> From: Don Osborn
> To: [email protected]
> Sent: Sunday, September 17, 2006 10:40
> Subject: Unicode & space in programming & l10n
>
>
> A study published last year* mentioned the impact of Unicode's space
> requirements in aspects of programming and localization. How big an
> issue is the "size" requirement of Unicode for programmers these days,
> in terms of its wider potential use? (Some short excerpts are appended
> after the citation). DZO
>
>
> Paolillo, John. 2005. "Language Diversity on the Internet." In
> Paolillo, John, Daniel Pimienta, Daniel Prado, et al, eds. Measuring
> linguistic diversity on the Internet. A collection of papers. Montreal:
> UNESCO. (CI.2005/WS/06)
> http://unesdoc.unesco.org/images/0014/001421/142186e.pdf
>
>
> p. 47 (in the context of bias against localizing in diverse scripts):
>
> Technical bias arises in encoding schemes for text such as Unicode
> UTF-8, which causes text in a non-roman script to require two to three
> times more space than comparable text in a roman script. Here, the
> motivation stems from issues of compatibility between older roman-based
> systems and more recent Unicode systems.
>
> p. 73 (in discussion of encoding & multilingual ICT)
>
> In its most basic form, UTF-32, Unicode text occupies four times as much
> space as the same text in ASCII. Many software developers have assumed
> that users would not want this penalty for multilingual text,
> particularly if computer use occurs mainly in monolingual contexts.24
> Unicode offers other variable-length encodings that are more effi cient,
> but the space costs are passed on to non-roman scripts which are forced
> to consume more space. Although data storage costs have dropped
> considerably in the last decade, enough to make Unicode less of a
> problem, handling Unicode still substantially complicates the software
> developer's task, since most applications require inter-operability with
> ASCII. In addition, the larger sizes of Unicode documents carry costs
> for transmission, compression and decompression, and these costs are
> enough of a penalty to discourage use of Unicode in some contexts.
>
> p. 74 (English bias in markup & programming languages)
>
> Unfortunately, many commonly-used programming languages such as C do not
> yet offer standard support for Unicode.25 A growing number of languages
> designed for Web-based applications do (examples include Java,
> JavaScript, Perl, PHP, Python, and Ruby, all of which are widely
> adopted), but other systems, such as database software, vary more in
> their support for Unicode.
>
> [Footnote 25 The International Components for Unicode website offers an
> open-source C library that assists in Unicode support
> (http://oss.software.ibm.com/icu/). <http://oss.software.ibm.com/icu/%29.>
> ]
>
>
>
>

Next message: Steve Summit: "Re: Unicode & space in programming & l10n"
Previous message: Doug Ewell: "Re: Unicode & space in programming & l10n"
In reply to: Doug Ewell: "Re: Unicode & space in programming & l10n"
Next in thread: Doug Ewell: "Re: Unicode & space in programming & l10n"
Reply: Doug Ewell: "Re: Unicode & space in programming & l10n"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Sep 17 2006 - 20:51:22 CDT