Re: Undefined code positions in 8-bit character sets

From: David Starner (prosfilaes@gmail.com)
Date: Mon May 05 2008 - 21:02:35 CDT

Next message: Mark Davis: "Re: Undefined code positions in 8-bit character sets"

Previous message: Kenneth Whistler: "Re: Undefined code positions in 8-bit character sets"
In reply to: Kenneth Whistler: "Re: Undefined code positions in 8-bit character sets"
Next in thread: Mark Davis: "Re: Undefined code positions in 8-bit character sets"
Reply: Mark Davis: "Re: Undefined code positions in 8-bit character sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On Mon, May 5, 2008 at 9:19 PM, Kenneth Whistler <kenw@sybase.com> wrote:
> A basic ISO 8859-1 <--> Unicode converter shouldn't be
> stopping on an 0x90 byte, saying "hmmm, I wonder what this
> is all about?" and flagging some exception for potentially
> endless rumination by a heuristic algorithm before returning
> a conversion.
>
> You basically have two choices:
>
> 0x90 --> U+0090
>
> or
>
> 0x90 --> U+FFFD
>
> and the first is what U+0090 was encoded for in the first place
> and is what most commercial converters do, as far as I know.

I don't disagree with that. But there's a difference between ISO
8859-1, which has a space between 0x80 and 0x9F basically for the C1
controls, and Windows-1252, which has a collection of varied
characters in that range. In Windows-1252, the spaces clearly aren't
left open for C1 controls and are unusable as such; U+0090, when used
as a C1 control, demands that the data following be terminated by a
U+009C, which isn't in Windows-1252!

Worse, to convert U+0090 to 0x90 is as wrong as converting 0x90 to
U+0620; it's undefined what 0x90 means in Windows-1252, and what
U+0090 does mean couldn't possibly fit into the Windows-1252 character
set. To convert from Windows-1252 0x90 <-> U+0090 doesn't preserve the
semantics of that codepoint in either character set.

Next message: Mark Davis: "Re: Undefined code positions in 8-bit character sets"
Previous message: Kenneth Whistler: "Re: Undefined code positions in 8-bit character sets"
In reply to: Kenneth Whistler: "Re: Undefined code positions in 8-bit character sets"
Next in thread: Mark Davis: "Re: Undefined code positions in 8-bit character sets"
Reply: Mark Davis: "Re: Undefined code positions in 8-bit character sets"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon May 05 2008 - 21:06:11 CDT