Well, first, it is 17 planes (or have we switched to using hexadecimal
numbers on the Unicode list already?
Second, of course this is in connection with UTF-16. I wasn't involved
when UTF-16 was created, but it must have become clear that 2^16 (^
denotes exponentiation ("to the power of")) codepoints (UCS-2) wasn't
going to be sufficient. Assuming a surrogate-like extension mechanism,
with high surrogates and low surrogates separated for easier
synchronization, one needs
2 * 2^n
surrogate-like codepoints to create
2^(2*n)
new codepoints.
For doubling the number of codepoints (i.e. a total of 2 planes), one
would use n=8, and so one needs 128 surrogate-like codepoints. With n=9,
one gets 4 more planes for a total of 5 planes, and needs 512
surrogate-like codepoints. With n=10, one gets 16 more planes (for the
current total of 17), but needs 2048 surrogate codepoints. With n=11,
one would get 64 more planes for a total of 65 planes, but would need
8192 codepoints. And so on.
My guess is that when this was considered, 1,048,576 codepoints was
thought to be more than enough, and giving up 8192 codepoints in the BMP
was no longer possible. As an additional benefit, the 17 planes fit
nicely into 4 bytes in UTF-8.
Regards, Martin.
On 2012/11/26 19:47, Shriramana Sharma wrote:
> I'm sorry if this info is already in the Unicode website or book, but
> I searched and couldn't find it in a hurry.
>
> When extending beyond the BMP and the maximum range of 16-bit
> codepoints, why was it chosen to go upto 10FFFF and not any more or
> less? Wouldn't FFFFF have been the next logical stop beyond FFFF, even
> if FFFFFF (or FFFFFFFF) is considered too big? (I mean, I'm not sure
> how that extra 64Ki chars [10FFFF minus FFFFF] could be important...)
>
> Thanks.
>
Received on Tue Nov 27 2012 - 02:37:34 CST
This archive was generated by hypermail 2.2.0 : Tue Nov 27 2012 - 02:37:37 CST