Re: Is there a UTF that allows ISO 8859-1 (latin-1)?

From: Yung-Fong Tang (ftang@netscape.com)
Date: Thu Aug 20 1998 - 18:01:09 EDT


Dan Oscarsson wrote:

> Well, I have programs that use iso 8859-1 characters for special functions.
> But the most important advantages of backwards compatibility, is that
> old programs that do not understand UTF-8 can still work while UTF-8
> programs also work - on the same data!
> It is an impossible situation where some text files are in iso 8859-1 and
> some are in UTF-8. It will be a hopeless mess. And many UTF-8 programs
> die or stop reading a file when it is in iso 8859-1.

What you need is a meta data to tell your software about your data is encoded in
ISO-8859-1 or UTF-8. In HTTP, such metadata is the CHARSET attribute of
Content-Type header. and In HTML, such metadata is in the <META tag >Then your
software depend on the meta data to decode how to import the data. If the data is
in ISO-8859-1, then you convert your data from ISO-8859-1 into your internal code
presentation. If the meta data said it is in UTF-8, then you convert the data by
using the UTF-8 rule. And probably your software need to remember the origional
meta data so when you write the data out (such as HTML form post), you could
convert it back.

Request a UTF to compatable with ISO-8859-1 is like reqest the JPEG working group
to have a JPEG to compatable with GIF, or ask a VCD compatable with the LaserDisk
format. In the case of VCD and LaserDisk, instead of making the VCD compatable
with the LaserDisk, vender (may) make *VCD/LaserDisk player* (but not the disk-
the data) could play both disks. It is job for the player designer to solve the
compatability issue instead of the job for the disk designer. Same thing apply
here.

One of the reason you request this is because in your head, there are only one
important charset to you - ISO-8859-1. However, for my company, we care many
charset- ISO-8859-1, ISO-8859-2, ISO-8859-5., ISO-8859-7, ISO-8859-9, KOI8-R,
Shift_JIS, Big5, GB2312, ECU-KR, etc. If your request is reasonable, then I would
like to ask someone to design a UTF compatable with Big5 and GB2312, and
Shift_JIS, and KOI8_R. (just joking.)

> For several languages, the adaptive encoding of UTF-8 I suggested, would be
> more compact than pure UTF-8. And what I can see also self-synchronization
> is retained. It may be that the lexical string order is not fully retained, but
> any program needing that can have the text as pure UTF-8 (or UCS)
> in the program. The important thing is that all normal text in files and other
> simple storage (that is not databases) should be iso 8859-1 compatible.

Why it should be ISO 8859-1 compatable. If it should be ISO 8859-1 compatable,
should it be Shift_JIS compatable also ? Should it be KOI8_R compatable also ? ISO
8859-1 is only designed for Western European latin language, there are no reason
that all normal text in files and other simple storage should be ISO 8859-1
compatable but not ISO 8859-2 compatable, or not Shift-JIS, Big5 compatable.

> Data storage that needs special software to access the storage device
> (like databases) can have any encoding they like internally, it is always
> accessed through the special software. A normal file can be accessed and
> written by many tools and must then be in a standard format that most programs
> can handle.

And such *STANDARD FORMAT* cannot be ISO-8859-1. Why, because ISO-8859-1 cannot
encode Japanese, Korean, Chinese, and even Eastern European languages. That is THE
REASON why people proposed to have UTF-8. UTF-8 may not be the BEST choice we
could have, but ISO 8859-1 definitely is worst than it.

>
>
> >
> >If you really need a Latin-1 compatible UTF, then just use UTF-7 but do
> >not transform the characters in the 0x80-0xff range. This is a straight
> >forward modification of UTF-7 and it costs you just one or two bytes to
> >change in an UTF-7 implementation. This technique is so obvious and
> >trivial that it is not even worth to write a formal specification for
> >it.

Please do not create YET ANOTHER encoding scheme for no good reason.

> >
> >I hope it will not become popular. Another UCS encoding is certainly not
> >what the world has been waiting for.
> I agree, UTF-7 could be possible but is not wanted. My adaptive UTF-8 is
> really UTF-8, just that the software accepts not UTF-8 encoding sequences
> when reading and using iso 8859-1, if possible, when writing. Could easily
> be incorporated into existing UTF-8 software.

No, it is NOT. Once you change the meaning of byte combination, it is no longer
UTF-8.

>
>
> Dan
> --
> Dan Oscarsson
> Telia Prosoft AB Email: Dan.Oscarsson@trab.se
> Box 85
> 201 20 Malmo, Sweden



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:41 EDT