Re: CP1252 under Unix

From: Mark Davis (markdavis@ispchannel.com)
Date: Sat Mar 25 2000 - 15:37:59 EST

Next message: Frank da Cruz: "Re: CP1252 under Unix"
Previous message: Frank da Cruz: "Re: CP1252 under Unix"
Maybe in reply to: Markus Kuhn: "Re: CP1252 under Unix"
Next in thread: Keld J�rn Simonsen: "Re: CP1252 under Unix"
Reply: Keld J�rn Simonsen: "Re: CP1252 under Unix"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I agree with you that the overal goal should be to move to UTF-8 for transmission. However, ignoring 1252 and its cousins is both wrong and shortsighted.

1. Let's start with the wrong part. There are already IANA registered charsets that use the C1 area for graphic character sets.

Look at http://www.isi.edu/in-notes/iana/assignments/character-sets and you will find windows-1251.
Look at http://oss.software.ibm.com/icu/charset/, and you will find the html for a code charts listing 1251.

Voila -- a raft of codes in 80..9F

Adding "windows-1252" will not 'sully' the registry any more than it already is.

2. Now for the shortsighted part. The IANA registry is used for much more than simply interchange on the web. A registry of charset names is needed across all systems and platforms. That way, cross-platform programs can to identify the local charsets, and successfully and accurately translate those to and from Unicode/10646 or specific other codesets.

Our goal is to converge towards use of a single character set, but that transition is easier if we can precisely identify those character sets that ARE in use on the Web currently, not hiding our heads in the sand and hoping they will go away. There are probably more pages and emails in 1252 than any other. As long as the charset is correctly identified as 1252 (which I agree with you is an important problem right now), a recipient can correctly transform the characters to an encoding that can be handled.

* * *

The real problem, in our minds, with the IANA charset registry is that it is sadly lacking in two ways.

A. It are incomplete: it does not include names for all the charsets found in common use.

B. It is imprecise: given a name, you cannot easily (or at all!) find out the precise mapping. Many platforms mean slightly different things by a term like "Shift_JIS". If the transmitter and the recipient do not mean the same thing by the same name, then data transmitted in XML or HTML may suffer data corruption.

That's why we are undertaking the project to come up with unique designations for particular mappings as they *actually* occur on different platforms, and in different versions of those platforms. While we are just beginning this effort (with some initial results at at http://oss.software.ibm.com/icu/charset/), completion will give us some hope of relating those mappings back to the IANA names, so that we can make an informed guess as to the precise character set intended.

Mark

P.S. Notice also that even if the one-to-one mappings are the same, the *fallback* mappings may be very different cross platform. For example, if you look at the MS mappings in the code charts for 1252 and 932 (look at the end of each file), you find differences that are not simply explained by the increased number of characters in 932.

Luckily, this is not as big a problem for XML/HTML, since one should always use NCRs for characters that are not in the target set, rather than using fallbacks. For other domains, however, it may be important to provide qualified names for two mappings that differ, even if only in fallback mappings.

Keld Jørn Simonsen wrote:

> >From an Internet IETF point of view, there are only a few
> charsets that are recommended for use with HTTP,
> including the iso-8859 series, UTF-8, and JIS.
> cp1252 is not amongst them, as there is not registered
> charse with IANA with this name, and I doubt that it ever be
> recommended.
>
> My advice would be for cp1252 pages, that the be either marked as
> iso-8859-1 and then the extra characters be given with their
> decimal &xxxx; UCS code, or they be encoded in UTF-8.
>
> Just putting cp1252 out on the line, as done by
> major players like MS Word, is against IETF policies
> and recommendations.
>
> Keld

Next message: Frank da Cruz: "Re: CP1252 under Unix"
Previous message: Frank da Cruz: "Re: CP1252 under Unix"
Maybe in reply to: Markus Kuhn: "Re: CP1252 under Unix"
Next in thread: Keld J�rn Simonsen: "Re: CP1252 under Unix"
Reply: Keld J�rn Simonsen: "Re: CP1252 under Unix"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:00 EDT