Re: Is there a UTF that allows ISO 8859-1 (latin-1)?

From: Gunther Schadow (gunther@aurora.rg.iupui.edu)
Date: Tue Aug 25 1998 - 11:19:22 EDT


Kevin,

> Okay, Gunther, here's my take on this. You want to use Unicode, but you
> have lots of Latin-1 text you want to still be able to use. Your idea
> is to define a new UTF-8 variant that states that illegal sequences
> should be interpreted as Latin-1 bytes. You will then declare all your
> data to be "UTF-8x" or whatever you call it.

Well, that's what I understand from Dan Oscarson's idea of ``adaptive
UTF-8''. I am more in favor of a trivial variant of UTF-7 that goes
without the (totally unnecessary) restriction of 7-bit US-ASCII and
uses an escape character that is less frequent than plus '+'.

> At first this may seem like a good idea, but it falls down on a lot
> of counts.
>
> 1) UTF-8x is not as neatly synchronising. For example, you can't
> tell the length of a character by looking at its first byte,
> and it's impossible to step backwards one character in the
> stream without rewinding to the start.

I don't think that (re)synchronization is an issue when you design a
UTF. Why? If you have programs that only know ISO Latin-1 (or ASCII)
there is no sense for them to know whether something is a 16 bit
Unicode character, they will treat the 16 bit character codes as
opaque. With UTF-7 or UTF-sane they will just showing me the escape
character and the base-64 sequence.

Those programs that do handle 16-bit Unicode will presumably convert
everything to 16-bit Unicode before doing any work. For instance, a
Java String is an array of 16 bit Unicode characters, no matter how
the stream was encoded. Isn't it true? It's no matter of the UTF to go
back and forth in a Java string.

> 2) You have duplicate encodings of lots of characters. For example,
> a pound sign can be encoded as A3 or C2 A3. This is going to
> render searches, sorts and comparisons immensely problematical.
        
> Also, looking at a piece of text, how will you be able to tell
> that some pound signs are encoded one way, and some the other?
> This will catch users out.

This applies only to the adaptive UTF-8 which I don't propose.

> 3) You will need to purify your data into either Latin-1 or UTF-8
> proper anyway before passing it on to other people. Other systems
> will not understand UTF-8x.

well, if I want this to accept then I wouldn't cry so loud on this
list. Of course I want the UTF-sane to be a public spec and actually
implemented. Because my use case is for a major international EDI
standards organization, I would make sure that this spec is going to
be favored by our standard.

> Unfortunately, this sort of scheme is encouraging the laziness
> of programming that will probably lead to programs taking this
> UTF-8x data, tagging it as text/utf-8 and sending it off to
> other systems.

No no, thats not what will happen. But I see you are expressing your
concerns against adaptive UTF-8 which I share with you to some extent.
        
> 4) There are legal Latin-1 sequences that would end up being
> interpreted as UTF-8. For example, you might have "DF AB" in a
> Latin-1 file, representing a German beta-S followed by
> a double angle bracket. UTF-8x would interpret this as U+07EB.

dito. Another argument against adaptive UTF-8, isn't it true, Dan?

> Defining an encoding scheme that cannot even guarantee the format of
> a pound sign is not going to win you any friends.

agree.

> This sort of "character-by-character" encoding autodetection is
> madness. What you really want to do is what Japanese WWW users
> already do - autodetect of the encoding of a file as a
> whole. Autodetection of the format of a data stream is fairly easy -

> make your applications autodetect
  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> the format of incoming text
> streams and convert everything to some UCS variant.

That's exactly what I want to avoid. I am not talking about *my*
applications for which *I* have the source code. I am talking about
legacy systems applications and applications that are no longer
maintained or which are so heavily based on the old dogma "sizeof char
== sizeof byte" that a change is practically infeasible. You know,
there are still assembler programs with a 25 year's history running in
big organizations.

> This will work reasonably well. Far better would be to move to some sort of
> scheme whereby you can tag data externally as Latin-1 or UTF-8 (or any other
> encoding), just as HTML 4.0 does.

This is just a little wrapper protocol, such as MIME. But I am not
talking about brand new protocols, I am talking about a compatibility
spec that allows software to become as Unicode-clean as possible
without having them to be changed.

regards
-Gunther

Gunther Schadow ----------------------------------- http://aurora.rg.iupui.edu
Regenstrief Institute for Health Care
1001 W 10th Street RG5, Indianapolis IN 46202, Phone: (317) 630 7960
schadow@aurora.rg.iupui.edu ---------------------- #include <usual/disclaimer>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:41 EDT