RE: explicit 20 bit Unicode range limit (was: UTF-20 etc.)

From: Hohberger, Clive P. (CPHOHBER@zebra.com)
Date: Tue Jan 26 1999 - 17:21:26 EST


UTF-16 is quite adequate for practical input/output use.
But a transformation of UTF-16 can be far more space efficient
for actually storing data characters from across 17 code planes.

For example, The new Ultracode barcode symbology is
Unicode based. Internally however it uses Mod-47 encoding as
the barcode encoding only has 47 codewords.

Here we use UTF-16 as the default input, and convert the
21-bit internal address (encode all 17 Planes 0-16) into a series
of pointers for a Mod-47 encoding engine:
        p (plane)
        h (Codebook, consisting of 4 preselected Rows in plane p)
        i (one of 4 pre-selected Rows in Codebook h)
        j (one of 8 32-bit strips in a Row @i)
        k (one of 32 characters in strip j)

So, even though we default to UTF-16 for barcode input and output,
internally we only specify the pointers that change from character
to succeeding character. This means we have a range of 17 Planes
but a minimal specification of each selected character in the barcode.

Think of the Ultracode codewords as a special mod-47 file compression.
If miminal file length is your goal, you ought to look at this approach
to address compression of UTF-16 for ideas as to what you could do
using mod 2^n for n=5, 6, 7 or 8

Converting mod 47 codewords to "bits", Ultracode encoding averages
5-7 bits for alphabetic languages and 15-17 bits for ideographic languages,
but encodes the BMP plus all of Planes 1-16. Its vastly more file-size
efficient than UTF-16 or UTF-20 and still covers all 17 planes.

I'll e-mail the latest Ultracode spec version on request. Final specs
will be released about mid year.

Clive Hohberger
Chairman, AIM Technical Symbology Committee

> -----Original Message-----
> From: schererm@us.ibm.com [SMTP:schererm@us.ibm.com]
> Sent: Tuesday, January 26, 1999 1:08 PM
> To: Unicode List
> Subject: explicit 20 bit Unicode range limit (was: UTF-20 etc.)
>
>
>
> Otto Stolz:
> "Hence, this proposal should be termed UCS-2.5 rather than UTF-20."
>
> Paul Keinanen:
> "Oh no, yet an other Unicode transfer format"
> "I do not think storage economy is of main importance for storing plain
> text today."
>
> Edwin Hart:
> "it appears that you would like to see this proposal supercede and make
> other UTFs obsolete."
> "I would like to see this proposal restructured into an outline something
> like the following."
>
>
> Well taken. I have to try to make myself clear.
> First of all, I am not trying to revolutionize Unicode/10646 - it is too
> good to be tampered with.
>
> All I am trying to do is to think about what happens in Unicode
> implementations once characters with code points/scalar values above
> U+ffff
> are actually used, which I expect to happen more and more relatively soon.
> There are private use planes, and the Internet Mail Consortium is pushing
> the not-yet-standard plane 14 tag characters (see TR 7 and www.imc.org).
>
> I recognize UTF-8 and UTF-16 as excellent formats that should be promoted,
> not replaced.
> Please note that there are practical limitations on the use of UTF-8,
> namely to use only what UTF-16 or even UCS-2 can cover, limiting it from 6
> to 4 or 3 octets maximum.
> Please note, too, that SCSU (TR 6) effectively establishes another UTF.
>
> The "UCS-2.5" that I was proposing is only one possible outcome of what I
> would really like to see.
> I think I should just get back to that point:
>
>
> * Title:
> Recommended range for Unicode abstract character values/UCS code points
>
>
> * Requirement Statement:
> It is desirable to have fixed-length formats to represent characters and
> character strings. The format unit length should be sufficient but
> minimal.
> "Minimal" depends on how many character code points are supposed to be
> used. Therefore, their number should not exceed a convenient range.
>
>
> * Problem Statement:
> Many implementations assume that Unicode/ISO-10646 is a pure 16b encoding.
> This assumption results in inconveniences when non-BMP characters are
> used.
> This is expected to be more frequent in the future.
>
> For example, Java .properties and .java files that use a traditional
> source
> encoding make use of an escape sequence \uxxxx to represent Unicode
> characters, using 4 hexadecimal digits. Non-BMP characters have to be
> written as surrogate pairs, although the source encoding is not UTF-16.
> (U-000e 0061 -> \udb40\udc61)
> Anyone who writes a .properties or .java file with a non-BMP character has
> to know or be able to calculate the UTF-16 form.
> It seems desirable to have a new escape sequence format that allows to
> write any character as a scalar value. If this is done, the format may
> take
> advantage of the recommended range of characters to limit the fixed length
> to less than 8, which would cover the full UCS-4 range. (e.g. U-000e 0061
> -> \q0e0061 or \qe0061 etc.)
>
> Currently, the recommended range is implicitly defined as what can be
> encoded with UTF-16/surrogate pairs and, for UCS, as what is defined as
> private use planes and what is under consideration for future assignments.
> This range is U-0000 0000 to U-0010 ffff, which makes 21 bits or 6
> hexadecimal digits necessary. The range is purposefully excessive.
>
>
> * Justification:
> Possible gains from explicitly limiting the character value range to
> somewhat less than the current implicit limitation may be small.
> However, the expected cost of doing so seems even smaller, suggesting that
> it is outweighed by the convenience of doing so.
>
> Gains are in storage units, lengths of hexadecimal representations,
> convenience/ease of use, and aesthetics ("cosmetic gains" to use a
> comdemning expression).
> Costs are in adding or changing a couple of paragraphs in forthcoming
> editions of the Unicode Standard and of ISO-10646.
>
>
> * Suggested Solution:
> Explicitly limit the recommended range for Unicode abstract character
> scalar values to fit into 20 bits, i.e., to the values 0000 0000 to 000f
> ffff. For ISO-10646, this is a recommended limitation to the planes 0 to
> 15.
> This is a reduction of the current implicit range by 1/17th.
> It also means that the current assignment of plane 16 as private use plane
> should be changed to "reserved" and/or generally the use of any plane
> above
> 15 discouraged - there are more private use planes and groups above plane
> 16.
>
>
> * Possible Applications:
>
> - Suggested Java ASCII escape sequence for characters, using a
> fixed-length
> 5-hex-digit format: \qxxxxx for U-000x xxxx.
> - Similar syntax for C/C++ etc.
> - More "natural" bit field definitions of 20 instead of 21 bits.
> - Possible specification of a "UCS-2.5" as a minimal indexable format
> encoding scalar values.
> - Easier range check for "acceptable" characters encoded in UTF-8:
> "if(octet[0]<0xf4)" instead of "if(octet[0]<0xf4 || (octet[0]==0xf4 &&
> octet[1]<0x90))"
> - HTML and SGML (XML?) already use variable-length sequences "&#xuuuu;"
> for
> character entities. I hope that the layout engines handle more than 4
> digits. This proposal would prevent 6-digit sequences.
> - More "natural" range of values, 2^n
>
>
> Sincerely,
>
> markus
>
>
> PS: This contribution does not necessary reflect corporate IBM opinions.
>
>
> Markus Scherer IBM RTP +1 919 486 1135 Dept. Fax +1 919 254 6430
> schererm@us.ibm.com
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:44 EDT