From: Kenneth Whistler (kenw@sybase.com)
Date: Thu May 19 2005 - 14:54:03 CDT
Dean Snyder suggested:
> Here, off the top of my head, are some problems with Unicode which,
> cumulatively, could prove its undoing:
>
> Needless complexity
Complex, indubitably.
But would you care to document the claim that the complexity
is "needless"?
> Stateful mechanisms
For bidirectional text, yes.
But all extant schemes for the representation of bidirectional
text involve stateful mechanisms. Would you care to supplant
the last decade's work by the bidirectional committee and
suggest a non-stateful mechanism that meets the same requirements
for the representation of bidirectional text?
> No support for a clean division between text and meta-text
Would you care to suggest replacements for such widely
implemented W3C standards as HTML and XML?
> Errors in actual content
Well, there's that. But any list longer than 30 items generally
has at least 1 error in it.
Generations of Chinese scholars have spent 2500 years trying
to get "the" list of Chinese characters correct. Never have,
never will.
> Legacy sludge
This is the point on which I (and a number of other Unicode
participants) are most likely to agree with you. The legacy
sludge in Unicode was the cost of doing business, frankly.
Legacy compatibility was what made the standard successful,
because it could and can interoperate with the large number of bizarre
experiments in character encoding which preceded it.
At some point, probably measured more in decades than in years,
the importance of all that legacy sludge will drop to the
level of irrelevance except for dedicated archivists and
digital archaeologists. When that happens, some bright,
young generation is going to say, "Hey, we could clean all
of that sludge out of Unicode and have a much more
consistent and easier to implement character encoding
standard. Whadya think? Should we try making it happen?"
And chances are, they *will* make it happen, eventually.
> Irreversibility
Irreversibility is the nature of standards. Nothing is more
harmful to a standard -- particularly a widely implemented
standard -- than trying to retract things from it that have
already been implemented. That is a fast track to fractionation
into incompatible, non-interworking, de facto variants of the
standard.
> >How will the "something better" solve these problems without
> >introducing new ones?
>
> Subsequent encoding efforts will be better because they will have
> learned from the mistakes of earlier encoders ;-)
Sure, but that doesn't answer Doug's question. You have simply
*assumed* here that subsequent encoding efforts wouldn't end
up introducing new problems.
First of all, it should be obvious that any introduction of a
new universal encoding will result in its own new "legacy"
problem for figuring out how to deal with (by then) multi-petabytes
of Unicode data, and with globally distributed software that manipulates
text encoded in Unicode.
> Probably the single most important, and extremely simple, step to a
> better encoding would be to force all encoded characters to be 4 bytes.
Naive in the extreme. You do realize, of course, that the entire
structure of the internet depends on protocols that manipulate
8-bit characters, with mandated direction to standardize their
Unicode support on UTF-8?
> >How will it meet the challenge of transcoding untold amounts
> >of "legacy" Unicode data?
>
> Transcoding Unicode data into some new standard could at least be done
> in ways similar to the ways pre-Unicode data is being transcoded into
> Unicode now - an almost trivial pursuit.
An "almost trivial pursuit" that employs hundreds of fulltime
programmers, often working on very intractable problems.
And these "trivial" problems don't go away. Every time somebody
else decides that some standard isn't "irreversible" and needs
to be fixed or extended, it creates another class of conversion
problems to be dealt with to keep information technology chugging
away. The latest nightmare has been dealing with GB 18030.
> But I do
> believe that hubris, intolerable in such matters, has unfortunately led
> to short-sighted mistakes in both the architecture and content of
> Unicode, mistakes Unicode is saddled with in perpetuity.
Mistakes in content we can argue about, I suppose.
But how has "hubris" led to "short-sighted mistakes in ... the
architecture"?
The most serious mistake I see in the architectural resulted from
the need to assign surrogates at D800..DFFF, instead of F800..FFFF.
But it wasn't "hubris" that led to the prior assignment of
a bunch of compatibility characters at FE30..FFEF -- just a lack
of foresight about the eventual form of the surrogate mechanism.
> As just one example of the kind of architectural change that could drive
> new encoding schemes, one could propose an encoding design that self-
> references its own mutability, thereby redefining "stability" to include
> not only extensibility but also reversibility. This would be
> accomplished by dedicating as version indicators, e.g., 7 of the 32 bits
> in every 4 byte character.
Whew! You started off your list of problems that may prove the undoing
of Unicode with "needless complexity". And the first architectural
change you suggest is putting version indication stamps in 7 bits of
32 bit characters?! Any software engineer I know would hoot such
a proposal off the stage for introducing needless complexity into
string processing. Sorry, but that one is a nonstarter.
--Ken
This archive was generated by hypermail 2.1.5 : Thu May 19 2005 - 14:55:00 CDT