Re: Stateful?

From: Marcin ‘Qrczak’ Kowalczyk (qrczak@knm.org.pl)
Date: Wed May 28 2008 - 05:11:19 CDT

  • Next message: Behnam: "Re: Arabic Lamalef missing Unicode Ligatures with Tashkeel and/or Shadda on Lam"

    2008/5/28 Kenneth Whistler <kenw@sybase.com>:

    >> UTF-16, after all, is stateful: if you lose the BOM,
    >> things can look very different.
    >
    > That is true of the UTF-16 encoding *scheme*. (See TUS 5.0,
    > D98, p. 106.) That is because in the UTF-16 encoding scheme,
    > an initial BOM is itself a stateful switch for byte order.
    > UTF-16BE and UTF-16LE, on the other hand are not stateful.

    It is a pity that UTF-8 is somewhat ambiguous over whether it is
    stateful. A UTF-8 with a BOM is stateful: the decoder must remember
    whether it has seen a BOM or whether it is past the beginning, and the
    encoder must remember if it is at the beginning, to know whether to
    emit U+FEFF twice for the case when the data begins with U+FEFF. A
    UTF-8 without any special treatment of U+FEFF at the beginning is
    stateless. Both variants of UTF-8 are in use. It would be better to
    distinguish them explicitly, like UTF-16 is distinguished from
    UTF-16BE & UTF-16LE.

    -- 
    Marcin Kowalczyk
    qrczak@knm.org.pl
    http://qrnik.knm.org.pl/~qrczak/
    


    This archive was generated by hypermail 2.1.5 : Wed May 28 2008 - 05:14:02 CDT