From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Oct 16 2003 - 11:51:45 CST
----- Original Message -----
From: "Jill Ramonsky" <Jill.Ramonsky@Aculab.com>
To: <unicode@unicode.org>
Sent: Thursday, October 16, 2003 4:35 PM
Subject: UTF-16 Beyond U+10FFFF (was: Java char and Unicode 3.0+)
>
> Here's an alternative idea.
>
> In UTF-16, as it's currently defined, codepoints in the range U+010000
> to U+10FFFF are represented as some High Surrogate (HS) followed by some
> Low Surrogate (LS). Also, as currently defined, any HS not followed by
> an LS, or an LS not preceeded by an HS, is illegal.
>
> So, to create even higher codepoints still, all you have to do is use
> some currently illegal sequences. For example:
>
> HS + LS => 10 bits from HS plus 10 bits from LS (as now)
> [This gives a range of 0x00000 to 0xFFFFF, to which we add 0x10000
> giving an actual range of U+10000 to U+10FFFF]
>
> HS + HS + LS => 10 bits from first HS plus 10 bits from second HS plus
> 10 bits from LS
> [This gives a range of 0x00000000 to 0x3FFFFFFF, to which we can add
> 0x110000 giving an actual range of U+110000 to U+4010FFFF]
>
> HS + HS + HS + LS => 10 bits from first HS plus 10 bits from second HS
> plus 10 bits from third HS plus 10 bits from LS
> [This gives a range of 0x0000000000 to 0xFFFFFFFFFF, to which we can add
> 0x40110000 giving an actual range of U+40110000 to U+1004010FFFF]
I don't like this idea: there's a performance penalty when parsing from
random places if they points to the HS codepoint: one has to scan backward
to find the start of the sequence (this is effectively the case with UTF-8,
but
not with UTF-16 where a single read indicates the position of the first
character in the encoding sequence).
I frankly would prefer the solution based on "hyper-surrogates" allocated
out of the BMP, with a couple of existing UTF-16 surrogates encoding
each hyper-surrogate (reserved for example in the special plane 14).
This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST