Re: explicit 20 bit Unicode range limit

From: schererm@us.ibm.com
Date: Wed Jan 27 1999 - 10:21:36 EST


With my last email, I tried to get the focus on using 16 planes instead of
17 for practical reasons.
I realize that something like "UCS-2.5" doesn't have any friends, and I
don't have a "business case" for it myself.

I still think that for hex-digit notations and other implementation details
it would be more convenient and "natural" to not use plane 16 (and up).
That should be possible just with some text changes, without test
implementations and verification thereof.

Of course, this would have to go through the approval processes like every
other proposal.
This looked worthy enough to me to be discussed, and I put it onto this
discussion forum.

Michael Everson:
"One would rather see implementations than more theory."
Me, too. For example, I would like to suggest using more-than-4-digit
escape sequences, for example, and need to know whether I should suggest 5
or 6 digits.
(I would also like to suggest using int's instead of short's whereever I
see APIs that take single characters and assume 16b are enough.)

Paul Dempsey:
"What you'd like is for Java escapes to specify the UCS-4 code point, and
generate the appropriate representation in the underlying encoding."
Yes, that is what I want.
"You don't need a 20-bit encoding to achieve this."
Correct.
"It might be appropriate for the Unicode standard to recommend that
software interpret escape codes and hex sequences as UCS-4 code points and
free the user from knowing the details of the encoding."
Well said.

I would like to know how we arrived at using 17 planes, and why the 3
highest planes are to be used for proposed characters (plane 14) and as a
private use area (planes 15 & 16) before any of the planes number 3 to 13
are considered for anything.
This must have been discussed when UTF-16 was born, probably 5 or 6 years
ago.

The 17 planes are surely a compromise between the implementer's dream of a
pure 16b character world and the ISO-10646 approach of using 32b words for
the more than 64k characters that our planet came up with to this day.

Are 16 planes not enough as a compromise?
That's still 1024k=1048576 code points (minus 2048=1046528 because of the
UTF-16 encoding).

(I can see the pattern where about the top 8th of the ranges for UTF-16,
group 00, and UCS-4 each are assigned as private use areas [actually, one
4th of UCS-4 because bit 31 is unused].)

Best regards,

markus

Markus Scherer IBM RTP +1 919 486 1135 Dept. Fax +1 919 254 6430
schererm@us.ibm.com



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:44 EDT