Re: Is there a UTF that allows ISO 8859-1 (latin-1)?

From: Chris Wendt (christw@microsoft.com)
Date: Fri Aug 21 1998 - 12:55:35 EDT


From: Dan Oscarsson <Dan.Oscarsson@trab.se>
Date: Friday, August 21, 1998 8:00 AM

>JPEG is not related to GIF (except being image formats), but ASCII and ISO
8859-1
>are true subsets of UCS. And UTF-8 is a way to encode UCS.

Mr Tang is right. iso-8859-1 is not a subset of UCS. It is an encoding that
offers code points for a subset of scripts in UCS. His analogy is correct:
iso-8859-1 and UTF-8 are two different encodings for the Latin 1 script.
While UTF-8 can encode a whole lot of other scripts, iso-8859-1 can not.

>No, the only important character set for me is UCS. And currently I use
only the
>first 256 codes of UCS as they are all I need, for the moment. Those codes
happen
>to be the same as ISO 8859-1.

Well, they are not the same in my edition of the Unicode standard. The
character A is encoded as 0x0041 in UCS-2 and 0x41 in iso-8859-1.

>To be able to allow other code values from UCS than the first 256, I need a
way
>to add those without making all software I have to day obsolete and the new
>software must be able to read all existing texts.

You need to have a way to carry meta information about the encoding of your
data, analog to the <META http-euiv....CHARSET=...> mechanism in HTML. In
your case it seems appropriate to default to iso-8859-1 if this meta
information is missing.

>UTF-8 will not work unless it can read and write files compatible with what
>I have today.

Solve it by modifying your application to process Unicode exclusively and
convert legacy data to Unicode before feeding it into the application. You
can do that transparently to the user.

>You who use non-latin character will also need something to mix old and
new,
>but your character sets
>are not true subsets of UCS and cannot be handled as easily as ISO 8859-1.

Not the case. The encodings Mr. Tang mentioned encode scripts that are
subsets of Unicode, just like Latin 1 is. You can convert Big 5 easily to
UCS-2, simply by applying a table.

>I doubt UTF-8 is the right choice for Chinese, UCS-2 would be better. And
for transport
>between places, UTF-8 would be fine.
>But most tools I have on my computer can only read 8-bit bytes and my files
are in
>ISO 8859-1. As UTF-8 is not compatible with current usage on my system and
I cannot
>expect software venders to fix my software any time soon, and new software
using
>UTF-8 cannot read my old files, UTF-8 has not usage om my system.

Then I don't understand what this is all about. If you don't care about
UTF-8, why do you want to change it's definition? The whole reasoning behind
UTF-8 is that it passes nicely through 8-bit interfaces and does not require
full 16-bit enabling.

Chris Wendt



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:41 EDT