Re: Undefined code positions in 8-bit character sets

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon May 05 2008 - 20:19:03 CDT

Next message: David Starner: "Re: Undefined code positions in 8-bit character sets"

Previous message: Doug Ewell: "Re: Undefined code positions in 8-bit character sets"
Maybe in reply to: Andreas Prilop: "Undefined code positions in 8-bit character sets"
Next in thread: David Starner: "Re: Undefined code positions in 8-bit character sets"
Reply: David Starner: "Re: Undefined code positions in 8-bit character sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Doug Ewell said:

> >> On the other hand, Windows-1252 might be extended again and assign a
> >> meaning to 0x90, so it is probably better not to map any Unicode
> >> codepoint to that value.
> >
> > I disagree. If you do not map U+0090 to 0x90 for Windows-1252, all you
> > are doing in ensuring an interoperability bug both with Windows and
> > with other commercial applications doing conversions.
>
> If you are working in either ISO 8859-1 or Windows-1252, and encounter
> the byte 0x90, you've got problems already. You might do well to ask
> yourself whether your text is even in one of those encodings, or whether
> it is mislabeled or a bad assumption was made.

Sure, if you're working in the "wild", so to speak, dealing
with conversions of mislabelled documents full of potential
data corruptions, and having to make use of heuristics to
determine what actual encodings are, and what is good data
versus bad data.

But that is another layer up from what I'm talking about.

A basic ISO 8859-1 <--> Unicode converter shouldn't be
stopping on an 0x90 byte, saying "hmmm, I wonder what this
is all about?" and flagging some exception for potentially
endless rumination by a heuristic algorithm before returning
a conversion.

You basically have two choices:

0x90 --> U+0090

or

0x90 --> U+FFFD

and the first is what U+0090 was encoded for in the first place
and is what most commercial converters do, as far as I know.

Then if you want to stop and ask, "Hey! What is this U+0090
(or substituted U+FFFD) doing in my 8859-1 data?! I bet there
is an error here I should check into!" well, that is a
perfectly valid thing to do. But I think it is conceptually
(and software architecturally) an epiphenomenon on the basic
conversion definition.

--Ken

Next message: David Starner: "Re: Undefined code positions in 8-bit character sets"
Previous message: Doug Ewell: "Re: Undefined code positions in 8-bit character sets"
Maybe in reply to: Andreas Prilop: "Undefined code positions in 8-bit character sets"
Next in thread: David Starner: "Re: Undefined code positions in 8-bit character sets"
Reply: David Starner: "Re: Undefined code positions in 8-bit character sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon May 05 2008 - 20:21:54 CDT