Re: HTML5 encodings (was: Re: BOCU patent)

From: Doug Ewell (doug@ewellic.org)
Date: Sun Dec 27 2009 - 12:44:12 CST

  • Next message: Mark Crispin: "Re: HTML5 encodings (was: Re: BOCU patent)"

    "verdy_p" <verdy underscore p at wanadoo dot fr> wrote:

    > How could you avoid using any multibyte encoding for transporting
    > characters of the UCS?

    You can't. I didn't say I think that multibyte encodings are stateful,
    only that "some people" do. I've seen this stated on mailing lists from
    time to time.

    > single-byte encodings are dead immediately when you have to manage
    > many encodings (which means even more states just to maintain the
    > long-list list of encodings to support (some of them with many more
    > ambuguities about their mapping to the UCS, than characters actually
    > encoded in the UCS). And if these charsets were not registered
    > internationally (the ISO working group about it has now closed its
    > work, even if there remains a IANA registry open mostly for some large
    > vendors that have mostly stopped developing new 8-buit charsets, or
    > national authorities)

    I think everyone agrees that ISO 2022 is stateful, yes.

    > If I look at UTF-32BE or UTF-32LE, it has only 4 states (you have to
    > merge the final states with the initial state). Mixing them and
    > supporting the optional BOM requires adding 3 other states so you have
    > finally 11 states for UTF-32. With UTF-8 you only 10 states (if you
    > count them for each possible length, andmerge the final states with
    > the initial state), one less than UTF-32. So UTF-8 still wins : it is
    > LESS stateful than UTF-32...

    Usually, at least on this list, the transient information needed while
    parsing multiple bytes into a single code point isn't thought of as
    "state." When you parse multiple bytes into an integer value of some
    sort, and still have to apply additional knowledge to turn THAT into a
    code point (as in ISO 2022 or UTF-16), that is state.

    > Clearly, UTF-16BE and UTF-16LE are the simplest encodings, with less
    > states, it will probably be more secure and definitiely fasterto
    > compute for very large volumes at high rates (such as in memory).

    Because of the surrogate mechanism, there is no way I personally would
    consider UTF-16 to be "simpler" than UTF-32. In the best case, it is
    "as simple as" UTF-32. It has other advantages, mostly related to size,
    but simplicity over UTF-32 is not one of them.

    --
    Doug Ewell  |  Thornton, Colorado, USA  |  http://www.ewellic.org
    RFC 5645, 4645, UTN #14  |  ietf-languages @ http://is.gd/2kf0s ­
    


    This archive was generated by hypermail 2.1.5 : Sun Dec 27 2009 - 12:51:36 CST