Markus Kuhn wrote:
> UTF-8 has a large number of very neat properties that are not
> possible to get with any of the proposals for a Latin-1 compatible
> encoding, especially the combination of self-synchronization, the
> compactness (only up to 3 characters length) and the preservation of
> the UCS-4 lexical string order (important for things such as B-trees
My problem with UTF-8 is less that I have to handle my existing
single-byte Latin-1 characters as exceptions but more that UTF-8 uses
non-Latin-1 values in ISO-6429's C1 control character range 100xxxxx.
It regularly screws up my xterm (so that it displays boxes instead of
letters and the tabulator stops are at the weirdest places) when I
dump UTF-8 texts onto it. If I display UTF-8 text on a dumb Latin-1
browser, I cannot cut and paste into a Unicode window because the C1
characters got stripped. I have to teach every program like less to
pass the C1 codes. I cannot memorize that =C3=9E stands for Þ
like © stands for © because I don't even get to see
the =9E (it's not printable in CP1252, either).
UTF-8 encodes:
11: 110xxxxx 10xxxxxx
16: 1110xxxx 10xxxxxx 10xxxxxx
21: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
One could have defined a different UTF that was more considerate of
ISO 2022, 4873, 6429, 8859 systems and avoided all C1 codes like this:
10: 110xxxxx 101xxxxx
14: 1110xxxx 101xxxxx 101xxxxx
18: 11110xxx 101xxxxx 101xxxxx 101xxxxx
22: 111110xx 101xxxxx 101xxxxx 101xxxxx 101xxxxx
Want to trade the =FF (EOF=-1) avoidance for better compaction and
Latin-1 visibility? Try this:
10: 1010xxxx 11xxxxxx
15: 10110xxx 11xxxxxx 11xxxxxx
20: 101110xx 11xxxxxx 11xxxxxx 11xxxxxx
Really need the 21st bit for the private plane 16? Alright:
9: 10100xxx 11xxxxxx
15: 10101xxx 11xxxxxx 11xxxxxx
22: 1011xxxx 11xxxxxx 11xxxxxx 11xxxxxx
Three at most for more equal rights? Sure, save a branch:
16: 1010xxxx 11xxxxxx 11xxxxxx
22: 1011xxxx 11xxxxxx 11xxxxxx 11xxxxxx
You would not sacrifice any UTF-8 conveniences: self-synchronization,
lexical ordering and bit-shift simplicity remained. But the C1
avoidance would of course cost an occasional extra byte. I did
consider suggesting one of these as UTF-3, UTF-5, UTF-6 or UTF-9.
However we're no longer in 1993 where UTF-8 (and UTF-16 with its odd
0x10000 offset that made Unicode a 20.1 bit charset with 17 planes)
were still to be coined so we will rather have to enjoy the current
definitions carved in stone and live with them and work around the
problems their imperfections bring, i.e., fix xterm & Co. and use the
alternative escape sequences instead of the C1 controls.
> If you really need a Latin-1 compatible UTF, then just use UTF-7 but
> do not transform the characters in the 0x80-0xff range. This is a
> straight forward modification of UTF-7 and it costs you just one or
> two bytes to change in an UTF-7 implementation. This technique is so
> obvious and trivial that it is not even worth to write a formal
> specification for it. I hope it will not become popular. Another
> UCS encoding is certainly not what the world has been waiting for.
Another trivial possibility would be to mix Latin1 characters with
\u1234 Java escapes or 〹 numerical character references.
Different strokes for different uses. But UTF-8 is the standard.
Dan.Oscarsson@trab.se wrote:
> It is fully possible to define a way to use UTF-8 that is nearly
> totally compatible with ISO 8859-1. I call this "adaptiv UTF-8".
> This ought to be acceptible by most people and is what I think we
> should use. It works like this: UTF-8 uses sequences of bytes in
> the range 128-255 to encode UCS characters. These sequences are
> defined in a way that an UTF-8 encoded character sequence, when
> looked upon as 8-bit iso 8859-1 encoded bytes, will seldom look like
> normal text.
Indeed. You usually don't place a Latin1 symbol after an accented
capital letter or a whole row of them after an accented small letter.
> This means that an UTF-8 sequence can be identified as UTF-8
> encoding in normal text. This has been suggested by several people
> to be a way to identify if text is UTF-8 or not. But it can also be
> used to make UTF-8 nearly iso 8859-1 compatible, by letting the
> read/writing routines for UTF-8 be adaptible. When reading: if a
> sequence is a correct UTF-8 encoding sequence, decode it as UTF-8,
> if not use the byte as itself (just like is done for all byte values
> below 128). When writing: if the code value is below 255 and the
> resulting byte sequence does not look like an UTF-8 encoding, write
> the byte itself, otherwise encode using UTF-8.
This requires some lookahead but is surely doable. If you really use
this for writing then you may make a few Latin-1 readers happier but
you will also upset UTF-8 readers whose software chokes or uses a
different default (like Latin-0). When coexisting with standard UTF-8
files you will have two representations to grep for to search an
accent. And if you just want to liberally accept pre-UTF-8 texts then
why not also accept that vast number of Windows bullets and quotes
from code page 1252?
> The code for the above is fairely easy to write. I already uses the
> reading code in my web server. That is the only solution I can
> think of that could be acceptible by both those who only need UTF-8
> and thoses that need compatible with iso 8859-1 or more compact
> encoding.
SCSU <http://czyborra.com/scsu/> also accepts ISO-8859-1 transparently
and is definitely more compact but intransparent for UTF-8.
Murray Sargent <murrays@microsoft.com> wrote:
> Putting a UTF-8 BOM (0xEF 0xBB 0xBF) at the beginning of a UTF-8
> file is a good way to identify the file as UTF-8.
The =EF=BB=BF signature is indeed what the amendment to ISO-10646
suggests but UTF-8 doesn't really need a byte order mark because its
octet-stream has no byte order problems and besides the hassles that
inserting and stripping signatures creates there are many files not
allowed to start with just any signature but requiring #! or such.
Gunther Schadow had written:
> I recently discovered Unicode and I must say that it is great! I found
> out that the lower 8 bits of the Unicode are backwards compatible to
> ISO 8859-1 (Latin-1). Thus, if the high byte is zero, we would not
> really have to transmit it in messages. UTF-8 and UTF-7 does the trick
> for the old 7 bit ASCII set but requires me to render Latin-1 codes
> that have the high bit set unreadable by non-Unicode aware presentation
> programs. Also UTF-8 and UTF-7 require me to change all my ISO Latin-1
> texts to UTF.
The compatibility does fall short of expectations. But there are
always reasons for it and ASCII doesn't have it that easy either:
UTF-8 is still bitstripped by many mail gateways and you also have to
escape plus signs in ASCII text to convert them to UTF-7 and you
cannot grep for particular characters in UTF-7.
Cheers, Roman http://czyborra.com/
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT