Re: Is there a UTF that allows ISO 8859-1 (latin-1)?

From: Dan Oscarsson (Dan.Oscarsson@trab.se)
Date: Mon Aug 24 1998 - 06:10:23 EDT


From: Chris Wendt

>>JPEG is not related to GIF (except being image formats), but ASCII and ISO
>8859-1
>>are true subsets of UCS. And UTF-8 is a way to encode UCS.

>Mr Tang is right. iso-8859-1 is not a subset of UCS. It is an encoding that
>offers code points for a subset of scripts in UCS. His analogy is correct:
>iso-8859-1 and UTF-8 are two different encodings for the Latin 1 script.
>While UTF-8 can encode a whole lot of other scripts, iso-8859-1 can not.
>
>>No, the only important character set for me is UCS. And currently I use
>only the
>>first 256 codes of UCS as they are all I need, for the moment. Those codes
>happen
>>to be the same as ISO 8859-1.
>
>Well, they are not the same in my edition of the Unicode standard. The
>character A is encoded as 0x0041 in UCS-2 and 0x41 in iso-8859-1.

Well, both the characters in iso 8859-1 and the code values for
those characters are the same as in UCS. That is what I call a
true subset (though looking at only the characters you can find many
other character sets that are true subsets too, but only ascii and
iso 8859-1 has both characters and code values to be the same).
And 0x0041, 0x41, 0x000000000000041 all represent the decimal code
value 65. I cannot call that anything but the same. I can use 32 bits
to store iso 8859-1, it is still the same characters and code values
as when I use 8 bits.

And it is the fact that both character and code values are the same in
both UCS and iso 8859-1, that makes it so easy and nice to have a
UTF-encoding that is iso 8859-1 compatible. No tables are needed!

>
>>UTF-8 will not work unless it can read and write files compatible with what
>>I have today.
>Solve it by modifying your application to process Unicode exclusively and
>convert legacy data to Unicode before feeding it into the application. You
>can do that transparently to the user.

There is no way I can take 100:s of Gbytes and convert them into UCS-2,
UTF-8 or UCS-4. It is a hopeless task.
The only way I can change to using more UCS characters than those
represented by iso 8859-1, is by a slow changes where old and new
applications can live together.

>
>>You who use non-latin character will also need something to mix old and
>new,
>>but your character sets
>>are not true subsets of UCS and cannot be handled as easily as ISO 8859-1.
>
>Not the case. The encodings Mr. Tang mentioned encode scripts that are
>subsets of Unicode, just like Latin 1 is. You can convert Big 5 easily to
>UCS-2, simply by applying a table.

As I said above, Big 5 is a subset of UCS, in characters, not code values
and need a table. Iso 8859-1 can be handled without any knowledge of
code values.

>Then I don't understand what this is all about. If you don't care about
>UTF-8, why do you want to change it's definition? The whole reasoning behind
>UTF-8 is that it passes nicely through 8-bit interfaces and does not require
>full 16-bit enabling.

I care about the fact that UTF-8 is being implemented in some new
applications in away that is not compatible with current storage and
applications. UTF-8 does not pass nicely through 8-bit interfaces.
UTF-8 encoded Swedish is not readable by my applications used to 8-bit
iso 8859-1.

    Dan



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:41 EDT