RE: Undefined code positions in 8-bit character sets

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat May 10 2008 - 05:24:37 CDT

  • Next message: Jeroen Ruigrok van der Werven: "Siddham"

     

    > -----Message d'origine-----
    > De : unicode-bounce@unicode.org
    > [mailto:unicode-bounce@unicode.org] De la part de Andreas Prilop
    > Envoyé : lundi 5 mai 2008 17:31
    > À : unicode@unicode.org
    > Objet : Undefined code positions in 8-bit character sets
    >
    > I refer to
    > http://www.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT
    >
    > http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/
    > CP1252.TXT
    >
    > In ISO-8859-1,   code position 0x90 is mapped to U+0090.
    > In Windows-1252, code position 0x90 is listed as "undefined".
    >
    > Why are they treated differently?

    Windows codepages have never defined any C1 control in positions 80-9F.
    These were always reserved in all versions of these codepages for extensions
    to map "graphical" characters; so initially most of them were undefined
    until they were later assigned to characters. If they had been assigned to
    C1 controls, they would no longer be available for these extensions.

    > International Standard ISO/IEC 8859-1 does *not* define code
    > position 0x90. So it might also be listed as "undefined".

    Yes but ISO 8859 does not define any mapping for any of its variants: this
    was done to be compatible with other transport or presentation protocols; it
    does not formally define a physical encoding, so the ISO 8859 standard may
    be transported over 7-bit protocols (for example using SS2/SS3 control
    sequences or other ISO 2022 compatible encodings.

    The IANA registration of the "iso-8859-1" encoding is defining in fact a
    encoding transformation from two standards: the ISO/IEC 8859-*, C0 and C1
    encodings are transformed into a merged 8-bit encoding.

    On the opposite, the IANA registration of Windows codepages is exactly the
    same as the Windows codepages, there's no merging and nothing is provided to
    offer compatibility with other transport or presentation protocols, so it
    supports only the 8-bit serialization.

    There's a difference between the coded character set and the IANA encoding
    which in fact merges several layers: the code mapping and the serialization
    into a stream of bytes.

    > Or, for purely practical reasons, 0x90 in Windows-1252 might
    > also be mapped to U+0090.

    This can just be a fallback encoding, but it is non-standard. The mapping
    may change at any time. Formally it is still undefined (and there's no sign
    now that it will be assigned later), and applications may map other
    application specific fallbacks (including U+FFFD or no mapping at all
    raising an exception in the decoder).

    For example, the Java "Charset" decoder maps an exception on this byte. It
    is normally preferable to map an exception or error on this position,
    letting the application choose what to do about this undefined position (for
    example, sich code suggests that the stream is effectively not encoded with
    Windows-1252, and another encoding should be tried.)

    NO standard document can contain any 0x90 byte if it claims to be encoded
    with Windows-1252; on the opposite, on the web, the "iso-8859-1" IANA
    registration is standard for the HTTP protocol, or for tagging internally
    documents like HTML or XML in some attribute or in HTTP presentation headers
    so it effectively maps C1 controls (which is standard there).

    If you look into what Windows effectiely does, there are two distinct
    implementations: one is found in the Win32 API that performs "ANSI to OEM"
    conversions or the reverse; but at the same time the Win32 API provides a
    way to customize the behavior in case of undefined code positions: the
    fallback is parametrable, and it may provide a default character such as the
    question mark "?", or a "do-nothing" option (leaving the code "unchanged")
    to avoid exceptions, or an error status during the conversion, raising an
    exception. Another conversion API can be used to perform "Multibyte to
    Unicode" conversions (or the reverse). Another better conversion API is
    performed in .Net libraries (that support many more charsets, mappings and
    conversions), in a way that is compatible across versions of Windows (the
    Win32 conversions are much more limited and are not extensible).

    So it's correct to have mappings for ISO-8859-1 that maps a C1 control
    U+0090 for 0x90 and no mapping at all for the Windows codepage were NO
    provision was made to allow C1 controls, notably when these mappings are
    used in the context of charset identification using IANA registered codes
    for tagging web contents in HTTP headers or in HTML, SGML, XML attributes
    (or pseudo-attribute of a document declaration tag).

    It is also interesting to look at how the ISO-8859-x and Windows-12xx
    codepages are remapped to EBCDIC-compatible codepages for roundtrip
    reversibility: there exists a full remapping of codes for ISO-8859-x
    (including C0 and C1 controls) to EBCDIC (with full reversibility in both
    directions) but a partial remapping for Windows-12xx codepages (or several
    EBCDIC variants, treating the Windows-12xx's 90-90 range differently, but
    most of these EBCDIC codepages don't have any IANA registration with a
    standard identifier (so they are not intended for data interchange in a
    heterogeneous networking environment). You may have to look for a very list
    of IBM-defined codepages defined only for local compatibility (some of them
    are installable on Windows using the Regional Settings control panel).

    Note finally that the .Net conversion libraries also allowing applications
    to specify the fallback mechanism to use in case of undefined code
    positions; but ISO-8859-* are guaranteed to never throw any decoder
    exception, and will never return U+FFFD or a fallback "?" character, or any
    C0 control like SUB if there's no SUB effectively encoded in the byte
    stream.

    Note finally that on Windows, Internet Explorer is not decoding ISO-8859-1
    using the standard assignment defined in IANA.: one (good?) reason is that
    most C0 and C1 controls are illegal in standard HTML/XML documents, even
    when using a charset like "iso-8859-1" that map them, but Internet Explorer
    will not invalidate the document if instructed to not "guess" another
    encoding; instead it will handle the "ISO-8859-1" tagging as if it was
    "Windows-1252" (meaning effectively that 0x80 will still be rendered a euro
    symbol, even if the document declares itself being encoded with the
    "ISO-8859-1" IANA-registered charset).

    Some Microsoft tools are generating bogous documents, such as web design
    tools (like FrontPage): it allows inserting euro symbols encoded 0x80 in
    ISO-8859-* charsets, or bullets, without any warning given to the user when
    saving the HTML page: at least these tools should propose to switch the
    encoding to Windows-1252 or to Unicode UTF-8, or it should use named or
    numeric character entities. When you edit a standard document declared with
    ISO-8859-* and using the expected and correct named or numeric character
    entities, and the nsave the edited HTML file, it silently replaces the
    entities with single byte codes, without changing the declared encoding or
    without prompting the user to do this; if the user maintains the ISO-8859-1
    charset, the euro symbols, bullets, ellipsis, rounded quotation marks or
    apostrophes should be saved as character entities in the HTML document.

    It think it has always been a severe bug of FrontPage (which exists and
    persists since many years now and also exists in IE itself when reading the
    page HTML content from DOM, and has never been corrected despite it was
    signaled since long: it causes severe compatibility problems, except with
    Internet Explorer that silently, but incorrectly, interprets a specified
    "ISO-8859-1" charset as if it was "Windows-1252"; for this reason, it's best
    to describe the situation by saying that Internet Explorer does not support
    correctly the ISO-8859-* registered charsets; this non standard behavior
    however has been added in other browsers to support the many web pages using
    this IE "quirk" mode; this old IE behavior should never be mimic'ed in
    standard mode as it does not respect the HTML and XML standards which
    clearly indicates that the IANA charset registration must be respected;
    apparently the bug is in the DOM HTML implementation of IE and affects
    FrontPage directly as it uses IE's DOM engine to perform the actual edits or
    to save the HTML code of the edited pages).



    This archive was generated by hypermail 2.1.5 : Sat May 10 2008 - 09:06:34 CDT