From: John H. Jenkins (jenkins@apple.com)
Date: Thu May 19 2005 - 10:23:33 CDT
On May 18, 2005, at 8:55 PM, Alexander Kh. wrote:
> That I realize. Especially when it is Microsoft who's paying most
> part of the
> bill - I totally foresee that their systems will be based on what
> they payed
> for. However, many people still pay for traffic, and switching from
> local
> encoding to unicode will mean double the traffic right away.
> However, if using
> state-machine approach, encodings can be changed on-the-fly by
> using a special
> escape-code. That's one way of getting benifits of both approach,
> not to mention
> the fact that local encodings are more well-thought in design.
>
>
What you're talking about already exists. It's called ISO 2022 and
it was (comparatively) a failure.
The advantages you cite for multiple encodings are real, on the
whole. It *does* add to storage and other overhead, it *does*
consume resources, and it *does* add to the complexity of systems.
If the user intends nothing but brain-dead English, then going beyond
ASCII really is unnecessary.
But--
There are also reasons why Unicode has succeeded in ways that ISO
2022 has not.
1) State engines require keeping track of state. Unicode has the
advantage that you can begin parsing text in the middle and be able
to find your way relatively quickly. Encodings where state must be
tracked mean you can't do this; you need to scan all the way back to
the beginning (potentially) to get your state information.
2) For system and application developers, the complexity does not go
away. My company does business throughout the world, and so we have
to be prepared for our software and the software written for our
system to work with all the writing systems of the world. Indeed,
even if we didn't do business in all the world the problem doesn't go
away. Wandering around Hong Kong, for example, where one would think
that Han and English were enough, I see signs in Nepalese and Thai.
I don't even care to list the number of languages on signage and in
newspapers in the US.
Having an ISO 2022-type approach means that not only do I have to
keep track of all the complexity that Unicode requires but I must
*also* deal with the additional headache of the bookkeeping
associated with the multiple encodings (converting data back and
forth, among other things) *and* the bookkeeping of maintaining the
state information. If I'm writing a word processor, it means I have
to be prepared for the document to switch character sets halfway
through.
In other words, you don't save effort at all. A state-based multiple-
encoding world is considerably *more* difficult for the programmer.
All you save is storage space on disk and transmission, and in
today's world, that's really not an enormous cost anymore.
3) For users of minority and rare languages and scripts, the fact
that there has to be additional effort to create and maintain
software which supports their particular needs means that their needs
are never met. (So far as I know, nobody ever implemented ISO 2022
in all its glory; they just had a specific market they wanted to
focus on and stuck there.) Large companies aren't willing to invest
that effort for small markets, so there isn't support at the system
level, and shoe-horning support into the system by a third party is
difficult if not impossible. (I know whereof I speak, having written
the Deseret Language Kit for Mac OS 8.) With the Unicode approach,
since you get every script and language for free, additional scripts
and languages can be supported via add-ons with minimal effort. Even
third-party add-ons will work in most cases with relatively little
effort.
========
John H. Jenkins
jenkins@apple.com
jhjenkins@mac.com
http://homepage.mac.com/jhjenkins/
This archive was generated by hypermail 2.1.5 : Thu May 19 2005 - 10:24:26 CDT