RE: Subject: Re: 32'nd bit & UTF-8

From: Hans Aberg (haberg@math.su.se)
Date: Thu Jan 20 2005 - 06:51:11 CST

  • Next message: Hans Aberg: "Re: 32'nd bit & UTF-8"

    At 20:24 -0800 2005/01/19, Peter Constable wrote:
    >> From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]
    >On
    >> Behalf Of Peter Kirk
    >
    >> This is a very significant point. Because a BOM may be used with
    >UTF-8,
    >> UTF-8 is in fact not quite as compatible with ASCII as has been
    >> presumed.
    >
    >If anyone ever assumed UTF-8 is compatible with ASCII, they were
    >mistaken. An ASCII processor can expect to receive octets strictly in
    >the range 0 - 127, period, whereas clearly UTF-8 data can contain octets
    >outside that range.

    This has been a problem in the past, that ASCII computers and computer
    programs strictly speaking only processes 7 bits, reserving the 8'th bit for
    various uses (such as parity, etc.) Examples are programs like TeX and the
    UNIX OS's.

    But because of the need of various ISO-Latin and ISO 8 bit encodings, this
    has changed. So these programs now all process not pure ASCII, byte 8-bit
    bytes where ASCII often is reserved for lowest 7 bits. (For example, MIME
    was originally invented, in order to enable 8 bit transfer of email, as it
    proved notoriously difficult to get the Internet email forwarding software
    updated to properly handle 8-bit bytes.)

    >ASCII is forward compatible with UTF-8 (a UTF-8 processor can process
    >ASCII data), not the other way around.

    So it is not pure ASCII we are speaking about, but programs that are already
    capable of handling 8-bit bytes, assuming that ASCII are those with leading
    bit 0. Then, without the requirement that the BOM should be ignored, there
    is often no or little changes needed to the software.

      Hans Aberg



    This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 06:52:49 CST