RE: Least used parts of BMP.

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Jun 04 2010 - 13:59:38 CDT

  • Next message: Michael Everson: "Re: Hexadecimal digits"

    > Message du 04/06/10 18:30
    > De : "Doug Ewell" <doug@ewellic.org>
    > A : "Mark Davis ☕" <mark@macchiato.com>
    > Copie à : unicode@unicode.org, "Otto Stolz" <Otto.Stolz@uni-konstanz.de>
    > Objet : RE: Least used parts of BMP.
    >
    >
    > Mark Davis ☕ <mark at macchiato dot com> replied to Otto Stolz <Otto
    > dot Stolz at uni dash konstanz dot de>:
    >
    > >> The problem with this encoding is that the trailing bytes
    > >> are not clearly marked: they may start with any of
    > >> '0', '10', or '110'; only '111' would mark a byte
    > >> unambiguously as a trailing one.
    > >>
    > >> In contrast, in UTF-8 every single byte carries a marker
    > >> that unambiguously marks it as either a single ASCII byte,
    > >> a starting, or a continuation byte; hence you have not to
    > >> go back to the beginning of the whole data stream to recognize,
    > >> and decode, a group of bytes.
    > >
    > > In a compression format, that doesn't matter; you can't expect random
    > > access, nor many of the other features of UTF-8.
    >
    > That said, if Kannan were to go with the alternative format suggested on
    > this list:
    >
    > 0xxxxxxx
    > 1xxxxxxx 0yyyyyyy
    > 1xxxxxxx 1yyyyyyy 0zzzzzzz
    >
    > then he would at least have this one feature of UTF-8, at no additional
    > cost in bits compared to the format he is using today.
    >
    > Of course, he will not have other UTF-8-like features, such as avoidance
    > of ASCII values in the final trail byte, and "fast forward parsing" by
    > looking at the first byte.

    The fast forward feature is certianly not decisive, but the random
    acessibility (from any position and in any direction) is certainly
    much more decisive and is a real positive factor for UTF-8, rather
    than the format proposed above, which can only be read in the forward
    direction, even if it can be accessed randomly to find the *next*
    character. to find the *previous* one, you have to scan backward until
    you eat at least one byte used to encode the character before it
    (otherwise, you don't know if a 1xxxxxx byte is the first one in a
    sequence, even if you can know if a byte is the last one.

    > He may not care. One thing I've noted about
    > descriptions of UTF-8, in the context of alternative formats for private
    > protocols, is that they always assume these features are important to
    > everyone, when they may not be.

    One decisive factor that has favored UTF-8 is that is is fully
    compatible with ASCII and that all ASCII values are used exclusively
    as single bytes to encode ASCII only, and not any trailer byte. This
    is what makes UTF-8 compatible with MIME and work exactly like the ISO
    8859-* series (including when it was converted bijectively between
    ISO-8859-1 and extended EBCDIC, when just ignoring which exact ISO
    8859 code page was used). One consequence is that characters are
    preserved, even if linewraps need to be changed (keeping the C0
    controls and SPACE unaffected, and preserving characters that are
    essential for lots of protocols, including digits, some punctuation,
    and the compatibility with lettercase mappings in the ASCII subspace).

    Another working encoding that would preserve the MIME compatibility
    and ASCII, and and bidirectional random accesses would be:

      - 0zzzzzz :
        encodes all 2^7 code points,
        from U+0000 to U+007F (with substracted offset=0)

      - 11yyyyyy 10zzzzzz :
        encodes all 2^12 code points,
        from U+0080 to U+107F (with substracted offset=0x0080)

      - 11xxxxxx 11yyyyyy 10zzzzzz :
        encodes all 2^18 code points,
        from U+1080 to U+04107F (with substracted offset=0x1080)

      - 11****vv 11xxxxxx 11yyyyyy 10zzzzzz (where * is an unused bit) :
        encodes about 76% at the start of 2^20 theoretical code points,
        from U+041080 to U+10FFFF (with substracted offset=0x41080)

    It would be a little more compact than UTF-8 for a larger subset of
    Unicode code points, and could even be compacted more than UTF-8, so
    that each sequence length does not overlap the next code space (as
    shown above), or by even compacting the surrogates code point space.
    You could also decide to drop the compatibility with the binary
    ordering of code points (

    But all these "improvements" will make very little difference
    (compared to existing UTF-8) in terms of compression for Latin, Greek,
    Cyrillic, Semtic and Indic scripts , and not even for ideographs whose
    most of them won't fit in the shorter sequences.

    You could also decide to "compress" the encoding space used by Korean,
    by transforming the large block of precomposed syllables into their
    canonically equivalent jamos (and possibly using an internal
    separator/disambiguator used only in the encoded form but not mapped
    itself to any Unicode character/code point, when needed to unify the
    encoding of leading and trailing consonnants, so that Korean will be
    encoded as a simple alphabet in a very small block : you could place
    this small block within the unused code space of surrogates, and you
    could also reorder the various blocks (notably hiragana, katakana,
    bopomofo, and South-East Asian scripts) so that they will be in the
    shorter sequences, and that you'll be encoding the less used
    characters (geometric symbols, block/line drawing, maths symbols,
    windings) into higher positions.

    You an invent many variants like this, according to your language
    needs, and then you'll reinvent the various national Asian charsets
    that are compatible with MIME and can be made conforming to Unicode
    (like GB18030 used in P.R.China)...

    So do you really want to do that ? May be encoding Unicode with
    GB81030 (or the newest versions of HKCS, KSC and JIS character
    encoding standards) is your immediate solution, and there's nothing to
    redevelop now as it is already implemented and widely available as an
    alternative to UTF-8...

    But if your need is to support some non major scripts (like Georgian),
    consider the fact that supporting these encoders will cost you more
    than just having to use today UTF-8 which is supported now everywhere.
    The cost for the extra storage space for these scripts (today, storage
    and even transmission are no longer a problem : any generic
    compression algorithms already work perfectly when they are used on
    top of an internal UTF-8 encoding for storage and networking, and
    UTF-16 or UTF-32 in memory for local processing of small amounts of
    texts up to several megabytes) is much less than the cost of adapting
    and maintaining systems supporting these specialized encodings (even
    if they are made compatible and conforming to Unicode so that they can
    represent all valid code points and preserve, at least, all the
    canonical equivalences).



    This archive was generated by hypermail 2.1.5 : Fri Jun 04 2010 - 14:01:25 CDT