RE: Least used parts of BMP.

From: Doug Ewell (doug@ewellic.org)
Date: Fri Jun 04 2010 - 11:00:47 CDT

  • Next message: Philippe Verdy: "RE: A question about "user areas""

    Mark Davis ☕ <mark at macchiato dot com> replied to Otto Stolz <Otto
    dot Stolz at uni dash konstanz dot de>:

    >> The problem with this encoding is that the trailing bytes
    >> are not clearly marked: they may start with any of
    >> '0', '10', or '110'; only '111' would mark a byte
    >> unambiguously as a trailing one.
    >>
    >> In contrast, in UTF-8 every single byte carries a marker
    >> that unambiguously marks it as either a single ASCII byte,
    >> a starting, or a continuation byte; hence you have not to
    >> go back to the beginning of the whole data stream to recognize,
    >> and decode, a group of bytes.
    >
    > In a compression format, that doesn't matter; you can't expect random
    > access, nor many of the other features of UTF-8.

    That said, if Kannan were to go with the alternative format suggested on
    this list:

    0xxxxxxx
    1xxxxxxx 0yyyyyyy
    1xxxxxxx 1yyyyyyy 0zzzzzzz

    then he would at least have this one feature of UTF-8, at no additional
    cost in bits compared to the format he is using today.

    Of course, he will not have other UTF-8-like features, such as avoidance
    of ASCII values in the final trail byte, and "fast forward parsing" by
    looking at the first byte. He may not care. One thing I've noted about
    descriptions of UTF-8, in the context of alternative formats for private
    protocols, is that they always assume these features are important to
    everyone, when they may not be.

    --
    Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org ­
    RFC 5645, 4645, UTN #14 | ietf-languages: is dot gd slash 2kf0s
    


    This archive was generated by hypermail 2.1.5 : Fri Jun 04 2010 - 11:02:19 CDT