Re: Undefined code positions in 8-bit character sets

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon May 05 2008 - 14:29:32 CDT

Next message: Kenneth Whistler: "Re: Stability Policy Update"

Previous message: Richard Wordingham: "Writing Numbers in Cuneiform"
Maybe in reply to: Andreas Prilop: "Undefined code positions in 8-bit character sets"
Next in thread: David Starner: "Re: Undefined code positions in 8-bit character sets"
Reply: David Starner: "Re: Undefined code positions in 8-bit character sets"
Reply: Doug Ewell: "Re: Undefined code positions in 8-bit character sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> Andreas Prilop wrote on Monday, May 05, 2008 4:30 PM
>
> >I refer to
> > http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT
> > http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT
> >
> > In ISO-8859-1, code position 0x90 is mapped to U+0090.
> > In Windows-1252, code position 0x90 is listed as "undefined".
> >
> > Why are they treated differently?

Different theory by the maintainers of the two sets of files.

I am the most recent maintainer of record for the 8859-X mapping
files posted on the Unicode website. For those I follow the
consensus of the UTC that mappings for control code points
in the 8859-X family of ASCII-derived encodings to/from Unicode
is least problematical if 0x00 <--> U+0000, 0x01 <--> U+0001,
etc. This is, in fact, the way that almost all commercial
conversions handle the control code conversions for 8859-X
character sets.

Since 8859-1.TXT and the other mapping tables posted on the
Unicode website are intended to provide practical *mapping*
guidelines for implementers, it would be pedantic in
the extreme (and counterproductive) to post them up as
documentation of the 8859-X standards *without* the control
code mappings.

The Microsoft mapping tables are contributed by and maintained
by Microsoft, and follow Microsoft standards practice for
table definition. 0x00..0x1F are mapped through to U+0000..U+001F,
but because most Microsoft code pages contain graphic characters
in the 0x80..0x9F range, those characters are mapped, but
unassigned code points are simply left #UNDEFINED, as is
also the case for Microsoft double-byte code pages. This allows
a distinction to be made between that status and #DBCS LEAD BYTE
values.

In practice, of course, when actually implmenting conversion
tables from Microsoft code pages to/from Unicode, nearly all
commercial implementations, including Microsoft's, map undefined
values in the 0x80..0x9F range (for non-DBCS code pages) to
the corresponding Unicode U+0080..U+009F control code character,
rather than to U+FFFD.

> > International Standard ISO/IEC 8859-1 does *not* define
> > code position 0x90. So it might also be listed as "undefined".
>
> 0x90 is defined in the IANA version of ISO-8859-1, which calls up the
> description in RFC1345. In a web context, I believe the IANA definition
> should take precedence over ISO/IEC.

While I agree with the conclusion that for web usage, mappings that
map through control codes rather than treating them as undefined
is the correct thing to do -- I do so for different reasons.

RFC 1345 is *extremely* dated. It is from 1992, and refers to
prepublication versions of 10646. The first edition of 10646
wasn't even published until 1993, and at that point we are
talking about a Unicode 1.1-level repertoire. The character
mnemonic table in RFC 1345 is full of errors, and the mapping
tables for various charsets at the end of RFC 1345 have not
been updated to track the updates of the 8859 standards nor
the updates in mapping practice for some charsets that resulted
from extensions to 10646.

>
> On the other hand, Windows-1252 might be extended again and assign a meaning
> to 0x90, so it is probably better not to map any Unicode codepoint to that
> value.

I disagree. If you do not map U+0090 to 0x90 for Windows-1252, all
you are doing in ensuring an interoperability bug both
with Windows and with other commercial applications doing
conversions.

--Ken

>
> > Or, for purely practical reasons, 0x90 in Windows-1252 might
> > also be mapped to U+0090.
>
> Which is reported to be what Windows *currently* actually does.
>
> Richard.
>
>
>

Next message: Kenneth Whistler: "Re: Stability Policy Update"
Previous message: Richard Wordingham: "Writing Numbers in Cuneiform"
Maybe in reply to: Andreas Prilop: "Undefined code positions in 8-bit character sets"
Next in thread: David Starner: "Re: Undefined code positions in 8-bit character sets"
Reply: David Starner: "Re: Undefined code positions in 8-bit character sets"
Reply: Doug Ewell: "Re: Undefined code positions in 8-bit character sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon May 05 2008 - 14:32:28 CDT