re: Least used parts of BMP.

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Jun 02 2010 - 08:49:35 CDT

  • Next message: John Dlugosz: "RE: Greek letter "LAMDA"?"

    > Resending (from Gmail), because the Unicode list rejected the SMTP server of my mail provider (Spamcop is defective).

    Nothing forbifs you to create new serializations of Unicode; you may
    even create it so that it will be a conforming
    process (meaning that it will preserve *all* valid Unicode code
    points, without even affecting their relative order,
    except possibly their normalization form as long as you preserve the
    canonical equivalences).
    If creating a Unicode conforming process is a goal, this means that
    your serialization will have to also preserve
    all PUAs and all valid codepoints of the BMP, with the only exception
    of isolated/unordered surrogates.

    Once this condition is met, nothing forbids you to reuse some binary
    encoding space, in your serialization scheme,
    including the space used by PUAs and surrogates (after all, the UTF-16
    encoding scheme already does that).

    According to your description, your scheme will be using 8-bit code
    units, so you don't have to worry about the byte
    order when serializing code units into bytes.

    But one goal will probably be not met by your encoding : there will be
    no separation of byte values and positions
    delimiting each encoded sequence. This will have a consequence: your
    encoing will not be processable from any random
    position and not backward, if you can't determine where a sequence
    starts that represents and encodes a single code.

    This means that fast search algorithms (Boyer-Moore and similar) won't
    be usable for fast plain text searches, and
    the whole encoded text stream will be processble only from its start
    and only in the foreward direction (after all
    there's a similar problem with compressed streams that store
    compressed texts into a single stream instead of
    sequences of blocks including some resynchronization points.)

    If you intend to use such encoding, it may be advisable that it
    contains a easily recognizable sequence, unique in
    all directions and recognizable from any starting point in the stream,
    as a synchronization point, and that itselfs
    actually represents *no* code point at all.

    With such resynchronization point (which may be freely inserted
    between encoded blocks of valid code points), a fast
    search algorithm will still be usable; additionally the
    resynchronization sequence may also be used as padding to
    fill the last bytes of data blocks with some fixed size (e.g. with
    blocks of 4KB, meaning that searches can be
    initiated at least from any position multiple of 4KB).

    Currently, UTF-8, UTF-16 and UTF-32 do not contain any
    resynchronization sequence, but the separation of code unit
    values still allows a stream to be scanned from any starting position,
    plus or minus a very small number of code
    units around this position (not exceeding a handful of bytes), to find
    the start of a encoded code point: this means
    that fast-search algorithms can work easily with them, with a very
    small impact on performance.

    If performance is really a target, you also have to consider what is
    the order of volume of these texts to process:
    if lots of texts need to be processed, it is certainly not the
    encoding itself which will impact the performance but
    the overall size of these texts, and a good compression scheme will
    allow to minimize the number of I/O (or
    equivalently memory page swaps, or misses in fast memory caches or disk caches).

    When the texts to process are in fact very small, and just a small
    part within larger data sets, there's absolutely
    no reason to use somthing else than very simple UTF's like UTF-8 or
    UTF-16 or UTF-32, even if these schemes is not
    very space efficient: actually these are not the texts that will be
    compressed alone, but the whole dataset
    containing them, in order to gain more performance: for example XML
    documents sill use UTF-8 by default, and it is
    still very efficient, even for large datasets, because these documents
    are actually compressed as a whole for
    storage or transmissions, and possibly split into blocks of fixed
    sizes (it's possible in XML to also include
    padding to make sure that all data blocks will still have a constant
    size, but generic compressions schemes also
    often contain their own resynchronization bytes for this goal).

    In other words: there's generally absolutely no need to use anything
    else than the basic standardized UTF's : trying
    to work at the codepoint-only level, instead of larger texts or
    datasets, is not the best approach to improve the
    performances, when you can gain a lot more by working on larger
    datasets containing these texts. And any new
    encoding scheme will instead defeat all the other many optimizations
    that have been worked in interoperable
    libraries.

    Unicode does not standardize more than the 3 encoding schemes (and
    their byte order variants for those with code
    units larger than bytes). I think it's much enough. But nothing
    forbids you to create others, and Unicode has
    included some informative annexes about such schemes for applicaitons
    needing some compression.

    What is true is that UTF-8 was initially designed to allow the
    encoding of code points up to 30 bits; some
    inefficiencies are remaining because not all of its encoding space is
    actually used; we could certainly have a UTF
    that is using byte-sizes code units, fully compatible with ASCII (1
    byte for them), and that would encode all other
    valid codepoints into sequences of 2 or 3 bytes, including with the
    possibility of determining which of them are
    start bytes or trailing bytes (for fast-search compatibility; 4-byte
    sequences can be avoided (but are still needed
    with UTF-8 because of its inefficient use of the valid encoding space,
    since the valid code points are now
    restricted to the 17 first planes, i.e. a bit more than 20 bits).

    Note that you said you'll need to encode 20 bits: this is wrong if you
    need full Unicode process conformance (17
    planes, minus the surrogates space is still more than 20 bits). If
    such a conformance is not met, your scheme will
    not be usable with lots of libraries and applications that are
    designed to work only with Unicode conforming
    processes (that also assume that all PUAs, in the BMP or in the last 2
    planes, can effectively be preserved).

    Philippe.

    > Message du 02/06/10 07:47
    > De : "Kannan Goundan"
    > A : unicode@unicode.org
    > Copie à :
    > Objet : Least used parts of BMP.
    >
    >
    > I'm trying to come up with a compact encoding for Unicode strings for
    > data serialization purposes. The goals are fast read/write and small
    > size.
    >
    > The plan:
    > 1. BMP code points are encoded as two bytes (0x0000-0xFFFF, minus surrogates).
    > 2. Non-BMP code points are encoded as three bytes
    > - The first two bytes are code points from the BMP's UTF-16 surrogate
    > range (11 bits of data)
    > - The next byte provides an additional 8 bits of data.
    >
    > Unfortunately, this doesn't quite work because it only gives me 19
    > bits to encode non-BMP code points, but I need 20 bits. To solve this
    > problem, I'm planning on stealing a bit of code space from the BMP the
    > private-use area. If I did, then:
    > - I'd get the bits needed to encoded the Non-BMP in 3 bytes.
    > - The stolen code points of the private-use area would now have to be
    > encoded using 3 bytes.
    >
    > I chose the private-use area because I assumed it would be the least
    > commonly used, so having these code points require 3 bytes instead of
    > 2 bytes wasn't that big a deal. Does this sound reasonable? Do
    > people suggest a different section of the BMP to steal from, or a
    > different encoding altogether?
    >
    > Thanks for reading.
    > -- Kannan
    >
    > P.S. I actually have two encodings. One is similar to UTF-8 in that
    > it's ASCII-biased. The encoding described above is intended for
    > non-ASCII-biased data. The programmer selects which encoding to use
    > based on what he thinks the data looks like.
    >
    >



    This archive was generated by hypermail 2.1.5 : Wed Jun 02 2010 - 08:51:03 CDT