From: Steve Summit (scs@eskimo.com)
Date: Sun Sep 17 2006 - 21:16:47 CDT
Pardon me for making what may sound like a cavalier and
irresponsible argument, and for restating several of the same
points Mark Davis made, and for generally proceeding in a manner
that, I know, won't be convincing to the die-hard skeptics, but:
worrying about any alleged space "inefficiency" of Unicode sounds
like the worst kind of false economy. This is not, after all,
1960, or 1972, or even 1990.
Today, hardly anyone does anything with plain text. Everyone
uses HTML, or XML, or Microsoft Word .doc, or PDF. All of these
formats bloat the byte count -- sometimes quite spectacularly --
beyond what a hypothetical flat-ASCII representation would
consume, yet few are worrying about this. (To be sure, there are
some naysayers and handwringers and foot-draggers here, too, but
the marketplace has generally ignored them, and nothing seems to
have come to a screeching halt in the face of all those popular
yet bulkier formats.)
And it's not just that we've moved past plain text to fancy text:
we've moved past text to graphics, and audio, and video. A few
years ago iPods and other MP3 players were storing absurdly large
amounts of music in absurdly small volumes. Today they're
storing video, too. Given a device that's tricked out with enough
storage to hold useful amounts of video, the amount of *text* it
can store is for all intents and purposes infinite. (Last night
I downloaded Wikipedia -- all of its text -- to my laptop.
Hardly made a dent.)
So even if there were no good reason for it, no one would
(or should) be complaining if for one reason or another text is a
mere factor of 2 bigger than it used to be, when everything else
(the aggregate size of the other data we're trying to store, and
the capacities of the devices we're storing it on) is orders and
orders of magnitude bigger than it used to be.
And, of course, it's not at all the case that "there's no good
reason for it". Internationalization is an eminently worthy goal.
The uniform way in which Unicode permits internationalization is
tremendously beneficial. Sweeping away the old biases in favor
of 8-byte Roman text is a very fine thing. (For someone to be
carping that there's still some "bias" towards Roman scripts even
under Unicode is a stunning example of missing the forest for the
trees.)
Yes, it's more work to write software that uses Unicode than
used 7-bit ASCII. But (a) 7-bit ASCII just isn't an option any
more (the world expects i18n), and (b) it's a hell of a lot
easier to use Unicode than to use the welter of incompatible
national character sets it replaced, and (c) it's *happening*.
The support is out there: the tools, the libraries, the fonts,
the whole nine yards. It's not pulling teeth to use Unicode
these days; it's almost as easy as using 7-bit ASCII used to be.
Most of the hard work has been done.
We work in a tremendously wasteful industry. Hundreds if
not thousands of man-years are wasted rewriting existing
functionality in new languages du jour. Megabytes and
gigabytes of memory and disk space are wasted on glitzy little
user interface gewgaws that have nothing to do with fundamental
usability or functionality. Modern programming languages and
development environments allow barely-trained, careless
programmers to churn out mountainously complex systems that,
somehow, mostly work, and are not much more than a factor of 10
or 100 times bigger, and a factor of 10 or 100 times slower,
than equivalently-functional hand-crafted microoptimized
assembler would be. All of this waste, and then some,
disappears inside the relentlessly marching maw of Moore's law.
In the face of all that, am I willing to expend a factor of two
expansion in raw text encoding, in order to support worldwide i18n?
In a heartbeat.
Now, I do understand that there remain a few aberrations. SMS
text messages, as I understand it, are still limited to 160 bytes
or some absurd number. (On phones that all have cameras in them
now, and are themselves beginning to support video!) And there
will always be those few naysayers and handwringers and foot-
draggers, beating the dead horse of the factor-of-N text expansion
as if it's some new revelation, or an interesting argument.
(Personally, I suspect their concern about memory usage is just a
smokescreen for various kinds of xenophobia. Either they don't
want to internationalize at all, or they're still harboring one
of those myopic little grudges about some particular aspect of
the way Unicode did it.) But those aberrations are just that:
aberrations.
I'm no scholar on this subject -- as anyone who cares about
citable references has seen, there weren't any next to any of
the pulled-out-of-the-air numbers I've been brandishing in this
message -- but from where I sit, there's really no argument about
Unicode any more. It's basically here, and by all appearances to
stay, and I'm certainly glad to have it that way.
This archive was generated by hypermail 2.1.5 : Sun Sep 17 2006 - 21:19:01 CDT