RE: Least used parts of BMP.

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Jun 04 2010 - 13:59:38 CDT

Next message: Michael Everson: "Re: Hexadecimal digits"

Previous message: Jonathan Rosenne: "RE: Hexadecimal digits"
Maybe in reply to: Kannan Goundan: "Least used parts of BMP."
Next in thread: Doug Ewell: "Re: Least used parts of BMP."
Reply: Doug Ewell: "Re: Least used parts of BMP."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> Message du 04/06/10 18:30
> De : "Doug Ewell" <doug@ewellic.org>
> A : "Mark Davis ☕" <mark@macchiato.com>
> Copie à : unicode@unicode.org, "Otto Stolz" <Otto.Stolz@uni-konstanz.de>
> Objet : RE: Least used parts of BMP.
>
>
> Mark Davis ☕ <mark at macchiato dot com> replied to Otto Stolz <Otto
> dot Stolz at uni dash konstanz dot de>:
>
> >> The problem with this encoding is that the trailing bytes
> >> are not clearly marked: they may start with any of
> >> '0', '10', or '110'; only '111' would mark a byte
> >> unambiguously as a trailing one.
> >>
> >> In contrast, in UTF-8 every single byte carries a marker
> >> that unambiguously marks it as either a single ASCII byte,
> >> a starting, or a continuation byte; hence you have not to
> >> go back to the beginning of the whole data stream to recognize,
> >> and decode, a group of bytes.
> >
> > In a compression format, that doesn't matter; you can't expect random
> > access, nor many of the other features of UTF-8.
>
> That said, if Kannan were to go with the alternative format suggested on
> this list:
>
> 0xxxxxxx
> 1xxxxxxx 0yyyyyyy
> 1xxxxxxx 1yyyyyyy 0zzzzzzz
>
> then he would at least have this one feature of UTF-8, at no additional
> cost in bits compared to the format he is using today.
>
> Of course, he will not have other UTF-8-like features, such as avoidance
> of ASCII values in the final trail byte, and "fast forward parsing" by
> looking at the first byte.

The fast forward feature is certianly not decisive, but the random
acessibility (from any position and in any direction) is certainly
much more decisive and is a real positive factor for UTF-8, rather
than the format proposed above, which can only be read in the forward
direction, even if it can be accessed randomly to find the *next*
character. to find the *previous* one, you have to scan backward until
you eat at least one byte used to encode the character before it
(otherwise, you don't know if a 1xxxxxx byte is the first one in a
sequence, even if you can know if a byte is the last one.

> He may not care. One thing I've noted about
> descriptions of UTF-8, in the context of alternative formats for private
> protocols, is that they always assume these features are important to
> everyone, when they may not be.

One decisive factor that has favored UTF-8 is that is is fully
compatible with ASCII and that all ASCII values are used exclusively
as single bytes to encode ASCII only, and not any trailer byte. This
is what makes UTF-8 compatible with MIME and work exactly like the ISO
8859-* series (including when it was converted bijectively between
ISO-8859-1 and extended EBCDIC, when just ignoring which exact ISO
8859 code page was used). One consequence is that characters are
preserved, even if linewraps need to be changed (keeping the C0
controls and SPACE unaffected, and preserving characters that are
essential for lots of protocols, including digits, some punctuation,
and the compatibility with lettercase mappings in the ASCII subspace).

Another working encoding that would preserve the MIME compatibility
and ASCII, and and bidirectional random accesses would be:

  - 0zzzzzz :
    encodes all 2^7 code points,
    from U+0000 to U+007F (with substracted offset=0)

  - 11yyyyyy 10zzzzzz :
    encodes all 2^12 code points,
    from U+0080 to U+107F (with substracted offset=0x0080)

  - 11xxxxxx 11yyyyyy 10zzzzzz :
    encodes all 2^18 code points,
    from U+1080 to U+04107F (with substracted offset=0x1080)

  - 11****vv 11xxxxxx 11yyyyyy 10zzzzzz (where * is an unused bit) :
    encodes about 76% at the start of 2^20 theoretical code points,
    from U+041080 to U+10FFFF (with substracted offset=0x41080)

It would be a little more compact than UTF-8 for a larger subset of
Unicode code points, and could even be compacted more than UTF-8, so
that each sequence length does not overlap the next code space (as
shown above), or by even compacting the surrogates code point space.
You could also decide to drop the compatibility with the binary
ordering of code points (

But all these "improvements" will make very little difference
(compared to existing UTF-8) in terms of compression for Latin, Greek,
Cyrillic, Semtic and Indic scripts , and not even for ideographs whose
most of them won't fit in the shorter sequences.

You could also decide to "compress" the encoding space used by Korean,
by transforming the large block of precomposed syllables into their
canonically equivalent jamos (and possibly using an internal
separator/disambiguator used only in the encoded form but not mapped
itself to any Unicode character/code point, when needed to unify the
encoding of leading and trailing consonnants, so that Korean will be
encoded as a simple alphabet in a very small block : you could place
this small block within the unused code space of surrogates, and you
could also reorder the various blocks (notably hiragana, katakana,
bopomofo, and South-East Asian scripts) so that they will be in the
shorter sequences, and that you'll be encoding the less used
characters (geometric symbols, block/line drawing, maths symbols,
windings) into higher positions.

You an invent many variants like this, according to your language
needs, and then you'll reinvent the various national Asian charsets
that are compatible with MIME and can be made conforming to Unicode
(like GB18030 used in P.R.China)...

So do you really want to do that ? May be encoding Unicode with
GB81030 (or the newest versions of HKCS, KSC and JIS character
encoding standards) is your immediate solution, and there's nothing to
redevelop now as it is already implemented and widely available as an
alternative to UTF-8...

But if your need is to support some non major scripts (like Georgian),
consider the fact that supporting these encoders will cost you more
than just having to use today UTF-8 which is supported now everywhere.
The cost for the extra storage space for these scripts (today, storage
and even transmission are no longer a problem : any generic
compression algorithms already work perfectly when they are used on
top of an internal UTF-8 encoding for storage and networking, and
UTF-16 or UTF-32 in memory for local processing of small amounts of
texts up to several megabytes) is much less than the cost of adapting
and maintaining systems supporting these specialized encodings (even
if they are made compatible and conforming to Unicode so that they can
represent all valid code points and preserve, at least, all the
canonical equivalences).

Next message: Michael Everson: "Re: Hexadecimal digits"
Previous message: Jonathan Rosenne: "RE: Hexadecimal digits"
Maybe in reply to: Kannan Goundan: "Least used parts of BMP."
Next in thread: Doug Ewell: "Re: Least used parts of BMP."
Reply: Doug Ewell: "Re: Least used parts of BMP."
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jun 04 2010 - 14:01:25 CDT