Re: RE: 32'nd bit & UTF-8

From: Philippe VERDY (verdy_p@wanadoo.fr)
Date: Tue Jan 18 2005 - 15:53:59 CST

  • Next message: D. Starner: "RE: Subject: Re: 32'nd bit & UTF-8"

    > De : "Hans Aberg"
    > > Jon Hanna (jon at hackcraft dot net) wrote:
    > >> 0x00...0x7F: 0xxxxxxx
    > >> 0x80...0x7FF: 110xxxxx 10xxxxxx
    > >> 0x800...0xFFFF: 1110xxxx 10xxxxxx 10xxxxxx
    > >> 0x10000...0x1FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
    > >> 0x200000...0x3FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
    > >> 0x4000000...0x7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx
    > >> 10xxxxxx 10xxxxxx
    > >> 0x80000000 - 0xFFFFFFFFF: 11111110 10xxxxxx 10xxxxxx 10xxxxxx
    > >> 10xxxxxx 10xxxxxx
    > >> 10xxxxxx
    > >> 0x1000000000 - 0x3FFFFFFFFFF: 11111111 10xxxxxx 10xxxxxx
    > >> 10xxxxxx 10xxxxxx
    > >> 10xxxxxx 10xxxxxx 10xxxxxx
    > >
    > > Of course this loses the fact that UTF-8 data will never contain 0xFE or 0xFF
    > > (and so UTF-16 with a BOM will never be confused with UTF-8, a fact that is
    > > important to XML parsers for one application).
    >
    > In , the use of BOM is
    > discouraged for use on UNIX platforms. So if endianness may appear to
    > becomes a problem, it might be better to use UTF-8 externally, and then
    > convert it to UTF-32/H/L internally in the program.

    I have not read any formal description (even informative) of an UTF-8-like transformation format that used bytes FE and FF.
    So if you really want to use FE and FF, to extend the old-deprecated-informative-RFC UTF-8 to keep the compatibility with byte order marks used to autodetect UTF-16 and UTF-32, you can consider this:
    - if FF is used, it has to be followed by FE to be recongized as a (not-recommanded) UTF-16 or UTF-32 BOM
    - if FE is used, it has to be followed by FF to be recongized as a (not-recommanded) UTF-16 or UTF-32 BOM

    So you have better options: don't use FE blindly in your extension: make sure that your extension will not allow encoding a FF byte just after it. Same thing for FE (can't be followed by FF).

    There are ways to handle this situation because the leading byte FE or FF for longer sequences will be followed only by trailing bytes which can't be equal to FE or FF in UTF-8.

    In fact you could as well make the encoding scheme more general to handle arbitrary (infinite) integer bit length, while also keeping the restriction to allow detection of UTF-16 and UTF-32.

    For example, you can avoid using FF completely for those extra-large integers. Instead you can say that FE will be followed by another leading byte, and in that case the number of bytes will not be 7, but 7 + the number of bytes indicated in the second leading byte after FE:

    Given:
    0x4000000...0x7FFFFFFF:
    1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

    This does not change:
    0x8000000...0xFFFFFFFFF:
    11111110 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

    But the extension still comes here:
    0x1000000000...0x1FFFFFFFFFF:
    11111110 110xxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

    The caveat is that when positioning at a random position (for example in the second byte above), you'll need to check the previous byte to see if it is equal to FE, and then adjust the number of trailing bytes expected after it. Notmally when pointing on the second byte 110xxxxx above, you know that is is followed by 2 other trailing bytes, but here it is folllowed by 6 trailing bytes, and there are 2 leading bytes...

    This is not much a problem because a random access algorithm already has to move backword when it points to a trailing byte, to find the leading byte. But in this extension, the discovered leading byte is possibly only the last one of a longer sequence of leading bytes, if it is preceded by one or more bytes equal to 11111110.

    This extended transformation format, is not UTF-8. It is an extension of it. It is not an encoding scheme, because it is not associated to a encoded character set, so it is not used to Unicode/ISO/IEC 10646 text. Call it the name you want, but not UTF-8...

    But I use it to deomonstrate that the transformation format is not closed, and still extensible. For very long integers however, it will become very inefficient. But it respect the contract: the number of bits set to 1 before the true encoded bits within the leading byte(s) will be equal to the total length of the sequence in bytes. In addition it will never generate any FF byte, so it will remain easy to make the distinction with UTF-16 or UTF-32 leading BOMs...



    This archive was generated by hypermail 2.1.5 : Tue Jan 18 2005 - 15:56:54 CST