From: David Starner (prosfilaes@gmail.com)
Date: Mon May 05 2008 - 21:02:35 CDT
On Mon, May 5, 2008 at 9:19 PM, Kenneth Whistler <kenw@sybase.com> wrote:
> A basic ISO 8859-1 <--> Unicode converter shouldn't be
> stopping on an 0x90 byte, saying "hmmm, I wonder what this
> is all about?" and flagging some exception for potentially
> endless rumination by a heuristic algorithm before returning
> a conversion.
>
> You basically have two choices:
>
> 0x90 --> U+0090
>
> or
>
> 0x90 --> U+FFFD
>
> and the first is what U+0090 was encoded for in the first place
> and is what most commercial converters do, as far as I know.
I don't disagree with that. But there's a difference between ISO
8859-1, which has a space between 0x80 and 0x9F basically for the C1
controls, and Windows-1252, which has a collection of varied
characters in that range. In Windows-1252, the spaces clearly aren't
left open for C1 controls and are unusable as such; U+0090, when used
as a C1 control, demands that the data following be terminated by a
U+009C, which isn't in Windows-1252!
Worse, to convert U+0090 to 0x90 is as wrong as converting 0x90 to
U+0620; it's undefined what 0x90 means in Windows-1252, and what
U+0090 does mean couldn't possibly fit into the Windows-1252 character
set. To convert from Windows-1252 0x90 <-> U+0090 doesn't preserve the
semantics of that codepoint in either character set.
This archive was generated by hypermail 2.1.5 : Mon May 05 2008 - 21:06:11 CDT