explicit 20 bit Unicode range limit (was: UTF-20 etc.)

From: schererm@us.ibm.com
Date: Tue Jan 26 1999 - 13:56:58 EST


Otto Stolz:
"Hence, this proposal should be termed UCS-2.5 rather than UTF-20."

Paul Keinanen:
"Oh no, yet an other Unicode transfer format"
"I do not think storage economy is of main importance for storing plain
text today."

Edwin Hart:
"it appears that you would like to see this proposal supercede and make
other UTFs obsolete."
"I would like to see this proposal restructured into an outline something
like the following."

Well taken. I have to try to make myself clear.
First of all, I am not trying to revolutionize Unicode/10646 - it is too
good to be tampered with.

All I am trying to do is to think about what happens in Unicode
implementations once characters with code points/scalar values above U+ffff
are actually used, which I expect to happen more and more relatively soon.
There are private use planes, and the Internet Mail Consortium is pushing
the not-yet-standard plane 14 tag characters (see TR 7 and www.imc.org).

I recognize UTF-8 and UTF-16 as excellent formats that should be promoted,
not replaced.
Please note that there are practical limitations on the use of UTF-8,
namely to use only what UTF-16 or even UCS-2 can cover, limiting it from 6
to 4 or 3 octets maximum.
Please note, too, that SCSU (TR 6) effectively establishes another UTF.

The "UCS-2.5" that I was proposing is only one possible outcome of what I
would really like to see.
I think I should just get back to that point:

* Title:
Recommended range for Unicode abstract character values/UCS code points

* Requirement Statement:
It is desirable to have fixed-length formats to represent characters and
character strings. The format unit length should be sufficient but minimal.
"Minimal" depends on how many character code points are supposed to be
used. Therefore, their number should not exceed a convenient range.

* Problem Statement:
Many implementations assume that Unicode/ISO-10646 is a pure 16b encoding.
This assumption results in inconveniences when non-BMP characters are used.
This is expected to be more frequent in the future.

For example, Java .properties and .java files that use a traditional source
encoding make use of an escape sequence \uxxxx to represent Unicode
characters, using 4 hexadecimal digits. Non-BMP characters have to be
written as surrogate pairs, although the source encoding is not UTF-16.
(U-000e 0061 -> \udb40\udc61)
Anyone who writes a .properties or .java file with a non-BMP character has
to know or be able to calculate the UTF-16 form.
It seems desirable to have a new escape sequence format that allows to
write any character as a scalar value. If this is done, the format may take
advantage of the recommended range of characters to limit the fixed length
to less than 8, which would cover the full UCS-4 range. (e.g. U-000e 0061
-> \q0e0061 or \qe0061 etc.)

Currently, the recommended range is implicitly defined as what can be
encoded with UTF-16/surrogate pairs and, for UCS, as what is defined as
private use planes and what is under consideration for future assignments.
This range is U-0000 0000 to U-0010 ffff, which makes 21 bits or 6
hexadecimal digits necessary. The range is purposefully excessive.

* Justification:
Possible gains from explicitly limiting the character value range to
somewhat less than the current implicit limitation may be small.
However, the expected cost of doing so seems even smaller, suggesting that
it is outweighed by the convenience of doing so.

Gains are in storage units, lengths of hexadecimal representations,
convenience/ease of use, and aesthetics ("cosmetic gains" to use a
comdemning expression).
Costs are in adding or changing a couple of paragraphs in forthcoming
editions of the Unicode Standard and of ISO-10646.

* Suggested Solution:
Explicitly limit the recommended range for Unicode abstract character
scalar values to fit into 20 bits, i.e., to the values 0000 0000 to 000f
ffff. For ISO-10646, this is a recommended limitation to the planes 0 to
15.
This is a reduction of the current implicit range by 1/17th.
It also means that the current assignment of plane 16 as private use plane
should be changed to "reserved" and/or generally the use of any plane above
15 discouraged - there are more private use planes and groups above plane
16.

* Possible Applications:

- Suggested Java ASCII escape sequence for characters, using a fixed-length
5-hex-digit format: \qxxxxx for U-000x xxxx.
- Similar syntax for C/C++ etc.
- More "natural" bit field definitions of 20 instead of 21 bits.
- Possible specification of a "UCS-2.5" as a minimal indexable format
encoding scalar values.
- Easier range check for "acceptable" characters encoded in UTF-8:
"if(octet[0]<0xf4)" instead of "if(octet[0]<0xf4 || (octet[0]==0xf4 &&
octet[1]<0x90))"
- HTML and SGML (XML?) already use variable-length sequences "&#xuuuu;" for
character entities. I hope that the layout engines handle more than 4
digits. This proposal would prevent 6-digit sequences.
- More "natural" range of values, 2^n

Sincerely,

markus

PS: This contribution does not necessary reflect corporate IBM opinions.

Markus Scherer IBM RTP +1 919 486 1135 Dept. Fax +1 919 254 6430
schererm@us.ibm.com



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:44 EDT