We had this Latin-1-compatible-UTF discussions several times before, it
is certainly in no way a new idea (check the news:comp.std.internat
archives):
- You will always need new software, no matter whether you use UTF-8 or
any fancy encoding in which ISO 8859-1 file do not have to be recoded.
So don't overestimate the practical advantages of the illusion of
backwards compatibility. The advantage of backwards compatibility with
ASCII in UTF-8 is only important because a number of ASCII characters
such as NUL and SOLIDUS have special functions in software that is
otherwise completely ignorant of the character set. No Latin-1
character has such special semantics in any software I am aware
of (I have yet to see a SHY implementation that can't be deactivated
easily).
- UTF-8 has a large number of very neat properties that are not possible
to get with any of the proposals for a Latin-1 compatible encoding,
especially the combination of self-synchronization, the compactness
(only up to 3 characters length) and the preservation of the UCS-4
lexical string order (important for things such as B-trees in DBMSs).
If you really need a Latin-1 compatible UTF, then just use UTF-7 but do
not transform the characters in the 0x80-0xff range. This is a straight
forward modification of UTF-7 and it costs you just one or two bytes to
change in an UTF-7 implementation. This technique is so obvious and
trivial that it is not even worth to write a formal specification for
it.
I hope it will not become popular. Another UCS encoding is certainly not
what the world has been waiting for.
Markus
-- Markus G. Kuhn, Security Group, Computer Lab, Cambridge University, UK email: mkuhn at acm.org, home page: <http://www.cl.cam.ac.uk/~mgk25/>
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT