Re: Is there a UTF that allows ISO 8859-1 (latin-1)?

From: Chris Wendt (christw@microsoft.com)
Date: Mon Aug 24 1998 - 16:56:00 EDT


From: Dan Oscarsson <Dan.Oscarsson@trab.se>
Date: Monday, August 24, 1998 4:11 AM

>Well, both the characters in iso 8859-1 and the code values for
>those characters are the same as in UCS. That is what I call a
>true subset (though looking at only the characters you can find many
>other character sets that are true subsets too, but only ascii and
>iso 8859-1 has both characters and code values to be the same).
>And 0x0041, 0x41, 0x000000000000041 all represent the decimal code
>value 65. I cannot call that anything but the same. I can use 32 bits
>to store iso 8859-1, it is still the same characters and code values
>as when I use 8 bits.

I understand your definition of "subset" now. You did come from the angle of
document compatibility and there it _does_ matter if A is stored as 0x41 or
0x0041. If you feed your 8-bit application a document containing 0x00410042,
it will most likely not render this to a string "AB".

>And it is the fact that both character and code values are the same in
>both UCS and iso 8859-1, that makes it so easy and nice to have a
>UTF-encoding that is iso 8859-1 compatible. No tables are needed!

You don't need a table either to convert from UTF-8 to iso-8859-1: use the
(table-less) algorithm from the Unicode Book, Version 2, to convert UTF-8 to
UCS-2 and then strip the leading 00s.

>There is no way I can take 100:s of Gbytes and convert them into UCS-2,
>UTF-8 or UCS-4. It is a hopeless task.
>The only way I can change to using more UCS characters than those
>represented by iso 8859-1, is by a slow changes where old and new
>applications can live together.

I did not suggest that you convert all your data. I suggested you convert
your applications reading that data to accept iso-8859-1 as well as other
encoded data. A viable method to achieve this seems to make your app expect
UCS-2 and add a converter which translates your data from iso-8859-1, UTF-8,
Big 5 or whatever to UCS-2 before you feed it to the app.

>As I said above, Big 5 is a subset of UCS, in characters, not code values
>and need a table. iso 8859-1 can be handled without any knowledge of
>code values.

UTF-8 can, too, as shown above.

>I care about the fact that UTF-8 is being implemented in some new
>applications in away that is not compatible with current storage and
>applications. UTF-8 does not pass nicely through 8-bit interfaces.
>UTF-8 encoded Swedish is not readable by my applications used to 8-bit
>iso 8859-1.

Yes, UTF-8 passes nicely through interfaces that choke on 8-bit reserved
characters like 0x00 or 0x5c or 0x2F or the complete C0 control range. I
didn't mean to say that you get the correct characters at the point where
these characters are rendered to the user. You will get this only by
enabling the rendering end of your application or convert it back to a
renderable encoding before display. I just said if you are employing
protocols in between that are 8-bit, but not 16-bit enabled, you can still
use them as long as they don't need to render the character.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:41 EDT