Re: UTF-16 Beyond U+10FFFF (was: Java char and Unicode 3.0+)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Oct 16 2003 - 11:51:45 CST

Next message: Rick McGowan: "Re: Java char and Unicode 3.0+ (was:Canonical equivalence in rendering: mandatory or recommended?)"
Previous message: Philippe Verdy: "Re: Beyond 17 planes, was: Java char and Unicode 3.0+"
In reply to: Jill Ramonsky: "UTF-16 Beyond U+10FFFF (was: Java char and Unicode 3.0+)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

----- Original Message -----
From: "Jill Ramonsky" <Jill.Ramonsky@Aculab.com>
To: <unicode@unicode.org>
Sent: Thursday, October 16, 2003 4:35 PM
Subject: UTF-16 Beyond U+10FFFF (was: Java char and Unicode 3.0+)

>
> Here's an alternative idea.
>
> In UTF-16, as it's currently defined, codepoints in the range U+010000
> to U+10FFFF are represented as some High Surrogate (HS) followed by some
> Low Surrogate (LS). Also, as currently defined, any HS not followed by
> an LS, or an LS not preceeded by an HS, is illegal.
>
> So, to create even higher codepoints still, all you have to do is use
> some currently illegal sequences. For example:
>
> HS + LS => 10 bits from HS plus 10 bits from LS (as now)
> [This gives a range of 0x00000 to 0xFFFFF, to which we add 0x10000
> giving an actual range of U+10000 to U+10FFFF]
>
> HS + HS + LS => 10 bits from first HS plus 10 bits from second HS plus
> 10 bits from LS
> [This gives a range of 0x00000000 to 0x3FFFFFFF, to which we can add
> 0x110000 giving an actual range of U+110000 to U+4010FFFF]
>
> HS + HS + HS + LS => 10 bits from first HS plus 10 bits from second HS
> plus 10 bits from third HS plus 10 bits from LS
> [This gives a range of 0x0000000000 to 0xFFFFFFFFFF, to which we can add
> 0x40110000 giving an actual range of U+40110000 to U+1004010FFFF]

I don't like this idea: there's a performance penalty when parsing from
random places if they points to the HS codepoint: one has to scan backward
to find the start of the sequence (this is effectively the case with UTF-8,
but
not with UTF-16 where a single read indicates the position of the first
character in the encoding sequence).

I frankly would prefer the solution based on "hyper-surrogates" allocated
out of the BMP, with a couple of existing UTF-16 surrogates encoding
each hyper-surrogate (reserved for example in the special plane 14).

Next message: Rick McGowan: "Re: Java char and Unicode 3.0+ (was:Canonical equivalence in rendering: mandatory or recommended?)"
Previous message: Philippe Verdy: "Re: Beyond 17 planes, was: Java char and Unicode 3.0+"
In reply to: Jill Ramonsky: "UTF-16 Beyond U+10FFFF (was: Java char and Unicode 3.0+)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST