Re: Is there a UTF that allows ISO 8859-1 (latin-1)?

From: Kevin Bracey (kbracey@acorn.com)
Date: Tue Aug 25 1998 - 05:57:26 EDT


In message <9808241740.AA06104@unicode.org>
          Gunther Schadow <gunther@aurora.rg.iupui.edu> wrote:

>
> But may I please ask you (especially the US-residents among the
> fighters for political correctness) at least not to interfere with a
> call for a UTF that is as compatible as Unicode is by itself? I think
> that the issue with UTF-7 and UTF-8 is more about broadening the
> narrow Anglo-American view on the world than to narrow the beautiful
> global view of Unicode towards an Euro-centrism.
>

Okay, Gunther, here's my take on this. You want to use Unicode, but you
have lots of Latin-1 text you want to still be able to use. Your idea
is to define a new UTF-8 variant that states that illegal sequences
should be interpreted as Latin-1 bytes. You will then declare all your
data to be "UTF-8x" or whatever you call it.

At first this may seem like a good idea, but it falls down on a lot
of counts.

    1) UTF-8x is not as neatly synchronising. For example, you can't
       tell the length of a character by looking at its first byte,
       and it's impossible to step backwards one character in the
       stream without rewinding to the start.
       
    2) You have duplicate encodings of lots of characters. For example,
       a pound sign can be encoded as A3 or C2 A3. This is going to
       render searches, sorts and comparisons immensely problematical.
       
       Also, looking at a piece of text, how will you be able to tell
       that some pound signs are encoded one way, and some the other?
       This will catch users out.
       
    3) You will need to purify your data into either Latin-1 or UTF-8
       proper anyway before passing it on to other people. Other systems
       will not understand UTF-8x.
       
       Unfortunately, this sort of scheme is encouraging the laziness
       of programming that will probably lead to programs taking this
       UTF-8x data, tagging it as text/utf-8 and sending it off to
       other systems.
       
    4) There are legal Latin-1 sequences that would end up being
       interpreted as UTF-8. For example, you might have "DF AB" in a
       Latin-1 file, representing a German beta-S followed by
       a double angle bracket. UTF-8x would interpret this as U+07EB.

Defining an encoding scheme that cannot even guarantee the format of
a pound sign is not going to win you any friends.

This sort of "character-by-character" encoding autodetection is madness. What
you really want to do is what Japanese WWW users already do - autodetect of
the encoding of a file as a whole. Autodetection of the format of a data
stream is fairly easy - make your applications autodetect the format of
incoming text streams and convert everything to some UCS variant.

This will work reasonably well. Far better would be to move to some sort of
scheme whereby you can tag data externally as Latin-1 or UTF-8 (or any other
encoding), just as HTML 4.0 does.
    

-- 
Kevin Bracey, Senior Software Engineer
Acorn Computers Ltd                           Tel: +44 (0) 1223 725228
Acorn House, 645 Newmarket Road               Fax: +44 (0) 1223 725328
Cambridge, CB5 8PB, United Kingdom            WWW: http://www.acorn.co.uk/



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:41 EDT