Re: CP1252 under Unix

From: Keld Jørn Simonsen (keld@dkuug.dk)
Date: Sat Mar 25 2000 - 17:22:53 EST


Hi Mark,

I agree with you that we should move to UTF-8, and even that we should
register cp1252.

But we should not use cp1252 on the wire, only a few very well
known charsets should be used here. The MIME rfcs say
only the 8859 series, and the HTTP rfcs say the 8859 series and
then some Unicode and JIS and korean charsets. We should not
expand these lists, except for the UTF-8 charset, which is the
general IETF recommendation of rfc2130.

Kind regards
Keld

On Sat, Mar 25, 2000 at 12:37:59PM -0800, Mark Davis wrote:
> I agree with you that the overal goal should be to move to UTF-8 for transmission. However, ignoring 1252 and its cousins is both wrong and shortsighted.
>
> 1. Let's start with the wrong part. There are already IANA registered charsets that use the C1 area for graphic character sets.
>
> Look at http://www.isi.edu/in-notes/iana/assignments/character-sets and you will find windows-1251.
> Look at http://oss.software.ibm.com/icu/charset/, and you will find the html for a code charts listing 1251.
>
> Voila -- a raft of codes in 80..9F
>
> Adding "windows-1252" will not 'sully' the registry any more than it already is.
>
> 2. Now for the shortsighted part. The IANA registry is used for much more than simply interchange on the web. A registry of charset names is needed across all systems and platforms. That way, cross-platform programs can to identify the local charsets, and successfully and accurately translate those to and from Unicode/10646 or specific other codesets.
>
> Our goal is to converge towards use of a single character set, but that transition is easier if we can precisely identify those character sets that ARE in use on the Web currently, not hiding our heads in the sand and hoping they will go away. There are probably more pages and emails in 1252 than any other. As long as the charset is correctly identified as 1252 (which I agree with you is an important problem right now), a recipient can correctly transform the characters to an encoding that can be handled.
>
> * * *
>
> The real problem, in our minds, with the IANA charset registry is that it is sadly lacking in two ways.
>
> A. It are incomplete: it does not include names for all the charsets found in common use.
>
> B. It is imprecise: given a name, you cannot easily (or at all!) find out the precise mapping. Many platforms mean slightly different things by a term like "Shift_JIS". If the transmitter and the recipient do not mean the same thing by the same name, then data transmitted in XML or HTML may suffer data corruption.
>
> That's why we are undertaking the project to come up with unique designations for particular mappings as they *actually* occur on different platforms, and in different versions of those platforms. While we are just beginning this effort (with some initial results at at http://oss.software.ibm.com/icu/charset/), completion will give us some hope of relating those mappings back to the IANA names, so that we can make an informed guess as to the precise character set intended.
>
> Mark
>
> P.S. Notice also that even if the one-to-one mappings are the same, the *fallback* mappings may be very different cross platform. For example, if you look at the MS mappings in the code charts for 1252 and 932 (look at the end of each file), you find differences that are not simply explained by the increased number of characters in 932.
>
> Luckily, this is not as big a problem for XML/HTML, since one should always use NCRs for characters that are not in the target set, rather than using fallbacks. For other domains, however, it may be important to provide qualified names for two mappings that differ, even if only in fallback mappings.
>
> Keld Jørn Simonsen wrote:
>
> > >From an Internet IETF point of view, there are only a few
> > charsets that are recommended for use with HTTP,
> > including the iso-8859 series, UTF-8, and JIS.
> > cp1252 is not amongst them, as there is not registered
> > charse with IANA with this name, and I doubt that it ever be
> > recommended.
> >
> > My advice would be for cp1252 pages, that the be either marked as
> > iso-8859-1 and then the extra characters be given with their
> > decimal &xxxx; UCS code, or they be encoded in UTF-8.
> >
> > Just putting cp1252 out on the line, as done by
> > major players like MS Word, is against IETF policies
> > and recommendations.
> >
> > Keld



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:00 EDT