>I recently discovered Unicode and I must say that it is great! I found
>out that the lower 8 bits of the Unicode are backwards compatible to
>ISO 8859-1 (Latin-1). Thus, if the high byte is zero, we would not
>really have to transmit it in messages. UTF-8 and UTF-7 does the trick
>for the old 7 bit ASCII set but requires me to render Latin-1 codes
>that have the high bit set unreadable by non-Unicode aware presentation
>programs. Also UTF-8 and UTF-7 require me to change all my ISO Latin-1
>texts to UTF. This is not satisfactory for a European who has produced
>lots of text in Latin-1 and who depends on Latin-1 aware but UTF-7/8
>unaware software. I wonder if there is no encoding like UTF-7 that would
>allow all lower eight bit to be set.
>I would like to know (1) if others feel the same concerns that there
>is one UTF missing, (2) if there are proposals out already, and (3)
>if such a proposal (much like UTF-7) would have a chance to be accepted
>by whoever is in charge of the UTF series (Unicode org? ISO?).
1) Yes. There is no way I can use UTF-8 on my system, in the way UTF-8 is
defined. I have large amounts of text encoded in ISO 8859-1 (which
is equivalent to the first 256 codes of UCS and Unicode) as well as applications
that I expect will not be fixed in a very long time. I see it as very
unfortunate that UTF-8 was defined as compatible with ASCII, but not with
ISO 8859-1. Both are true subsets of UCS and in very common use. It would
have been better to either have defined a true compact byte sequece encoding
that is not compatbile with ascii, or one compatible with the first
256 codes of UCS i 8-bit characters (this includes both ascii and iso 8859-1).
I expect many latin based languages would have a much compacter storage
of text, if UTF-8 was iso 8859-1 compatible.
2,3) I have not seen any proposals and some I have talked about doing
something about this have avoided the problem.
And probably most do not want one more UTF-format, despite how bad UTF-8 is,
BUT it is fully possible to define a way to use UTF-8 that is nearly
totally compatible with ISO 8859-1. I call this "adaptiv UTF-8".
This ought to be acceptible by most people and is what I think we should
use.
It works like this:
UTF-8 uses sequences of bytes in the range 128-255 to encode UCS characters.
These sequences are defined in a way that an UTF-8 encoded character sequence,
when looked upon as 8-bit iso 8859-1 encoded bytes, will seldom look like
normal text. This means that an UTF-8 sequence can be identified as UTF-8
encoding in normal text. This has been suggested by several people to be a
way to identify if text is UTF-8 or not.
But it can also be used to make UTF-8 nearly iso 8859-1 compatible, by
letting the read/writing routines for UTF-8 be adaptible.
When reading: if a sequence is a correct UTF-8 encoding sequence, decode it
as UTF-8, if not use the byte as itself (just like is done for all byte values
below 128).
When writing: if the code value is below 255 and the resulting byte sequence
does not look like an UTF-8 encoding, write the byte itself, otherwise
encode using UTF-8.
The code for the above is fairely easy to write. I already uses the reading code
in my web server.
-
That is the only solution I can think of that could be acceptible by both those
who only need UTF-8 and thoses that need compatible with iso 8859-1 or more
compact encoding.
I am willing to write an RFC about how it is done and example code for the
reader/writer.
Dan
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT