From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Oct 02 2007 - 10:29:53 CST
Mark E. Shoulson wrote:
> And nobody is seriously proposing a
> Next-Generation Unicode in which the so-called "Cleanicode" (Unicode
> where everything is done *right*) is implemented from scratch. Such a
> radical change would not be worth the pain of implementing it.
The same argument used against Dmitry's arguments about the need to encode
only things that *are* used could also be applied to your argument here.
Even though there's still no argument for a radicalchange in the way Unicode
is encoding texts, there's no guarantee that (most probably in some
looooooong term...) the Unicode or ISO 10646 standards are meant to be
eternal.
Like all other standards, these standards have a lifetime. They will work
and will be used until someone demonstrates that there's a superior way to
handle text in a more consitent way, *and* those proposing a newer standard
have convinced *much* enough other people changing the way they handle text
by adopting the newer proposal as their core encoding for handling
everything.
But trying to convince people to shift to another core standard will have
the same issues: to convince people to make the change, the newer standard
would need to exhibit and implement conversion rules that will allow good
interoperability with the *huge* amount of texts and applications that will
still depend on Unicode *only* for very long time.
So the designer of the new "Unicode II" or "Cleancode" or "Next-Generation"
standard (whatever its name) will have to face the same problem as Unicode:
handling lots of roundtrip compatibility with the best and most widely used
standards of the past, meaning that there will remain, in the new standard,
also various tricks needed for compatibility (so they will also have things
encoded like "compatibility characters" (or other entities), not recommended
for normal use, but still valid for long until everything works using only
the core standard with its "canonical" strings!).
And they will also have to convince not only the users of the standard for
encoding text, but also the designers of the other countless standards that
have adopted Unicode or ISO 10646 as their core encoding whose support is
now mandatory (if not the only encoding they support now), plus convincing
software implementers to adapt their softwares to support these changes,
plus convincing users to buy, install and use the new softwares (and support
the cost of this software upgrade).
I can't make any well-defined estimation of the generated cost of the
conversion from Unicode/ISO10646 to something else, but this would be
tremendous worldwide (really many, many billions dollars or euros or pounds
or other currencies) and would affect almost everybody on earth in their
daily life, due to the huge number of applications and objects that depend
now on a correct implementation of Unicode and ISO 10646.
Nothing will prevent a newer text encoding "standard" to be modelized,
implemented and used, but this will remain within small local areas for
specific needs by small communities of users with little interaction with
the rest of the world.
The only thing that can seriously happen now is the development of several
alternate encodings for scripts that are rarely used outside of these small
communities of specialists, or for scripts that are still not encoded (and
where Unicode or ISO 10646 should not interfere before there's some wide
agreement about all users in these communities, *and* they request the
encoding within the UCS for facilitating the interoperability with
applications and systems made by others outside these communities). In those
areas, there may exist errors that *may* be partially corrected in Unicode.
(Note: when I say "small" communities, I'm not speaking about the number of
people needing the requested encoding, but the number of people and
applications actually using it: this includes for example the Burmese
community, which is quite large, but that currently has lots of difficulties
with the current encoding of their script, so that the script is still
considered by them as not encoded, even if it is currently part of the
standard; the way it is perceived by them is that some parts of the existing
standard may be kept, but other part would need to be "deprecated" or "not
recommended" for general use of the script, because it will cause unsolvable
interoperability problems; but anyway, Unicode will not change the existing
encoding, the only thing it will do is to *add* other better characters with
better behaviour and properties, where this is needed *and* demonstrated by
actual use in some other non standard encodings, with good roundtrip
compatibility with the best practices demonstrated in those external
encodings).
For now I see no justification for changing the standard: if it's not
suitable as the core encoding for implementing some text handling algorithm
in some application, nothing prevents this application to impelemnt another
core encoding for local use, and conversion routines for their input and
output, if this facilitates the work they need to do internally (this
includes the possibility of using internally other character properties, or
other decompositions, or another normalization with different ordering of
the encoded entities, or the supplementation of the encoded entities with
other entities than just characters):
Just look at the UCA algorithm that makes such internal transforms to the
encoded text by converting characters into collation keys, which are other
entities not behaving like characters: even though the UCA algorithm is
standardized, the entities it handles are not standardized in a mandatory
way (they are "tailorable" everywhere) and it does not create a new standard
encoding meant for general data interchange (even the existing documented
collation keys for the DUCET are mutable across Unicode versions).
This archive was generated by hypermail 2.1.5 : Tue Oct 02 2007 - 10:33:36 CST