Re: Undefined code positions in 8-bit character sets

From: David Starner (prosfilaes@gmail.com)
Date: Mon May 05 2008 - 21:02:35 CDT

  • Next message: Mark Davis: "Re: Undefined code positions in 8-bit character sets"

    On Mon, May 5, 2008 at 9:19 PM, Kenneth Whistler <kenw@sybase.com> wrote:
    > A basic ISO 8859-1 <--> Unicode converter shouldn't be
    > stopping on an 0x90 byte, saying "hmmm, I wonder what this
    > is all about?" and flagging some exception for potentially
    > endless rumination by a heuristic algorithm before returning
    > a conversion.
    >
    > You basically have two choices:
    >
    > 0x90 --> U+0090
    >
    > or
    >
    > 0x90 --> U+FFFD
    >
    > and the first is what U+0090 was encoded for in the first place
    > and is what most commercial converters do, as far as I know.

    I don't disagree with that. But there's a difference between ISO
    8859-1, which has a space between 0x80 and 0x9F basically for the C1
    controls, and Windows-1252, which has a collection of varied
    characters in that range. In Windows-1252, the spaces clearly aren't
    left open for C1 controls and are unusable as such; U+0090, when used
    as a C1 control, demands that the data following be terminated by a
    U+009C, which isn't in Windows-1252!

    Worse, to convert U+0090 to 0x90 is as wrong as converting 0x90 to
    U+0620; it's undefined what 0x90 means in Windows-1252, and what
    U+0090 does mean couldn't possibly fit into the Windows-1252 character
    set. To convert from Windows-1252 0x90 <-> U+0090 doesn't preserve the
    semantics of that codepoint in either character set.



    This archive was generated by hypermail 2.1.5 : Mon May 05 2008 - 21:06:11 CDT