Re: Code pages and Unicode

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Mon, 22 Aug 2011 23:15:11 +0100

On Mon, 22 Aug 2011 14:06:00 +0100 (BST)
William_J_G Overington <wjgo_10009_at_btinternet.com> wrote:

> On Monday 22 August 2011, Andrew West <andrewcwest_at_gmail.com> wrote:
>
> > Can anyone think of a way to extend UTF-16 without adding new
> > surrogates or inventing a new general category?
> >
> > Andrew
>
> How about a triple sequence of two high surrogates followed by one
> low surrogate?

The problem is that a search for the character represented by the code
unit sequence (H2,L3) would also pick up the sequence (H1,H2,L3).
While there is no ambiguity, it does make searching more complicated
to code. The same issue applies to the suggestion of using
(H1,H2,L3,L4) sequences.

Now, we could use (H1,H2,L3,L4) sequences and never assign the (H2,L3)
combinations. They would therefore be category Cn, which currently
consists of both the unassigned characters and the non-characters.
However, I can't help feeling that they'd be almost a sort of
surrogate. It's slightly more efficient to replace L3 by a single BMP
character.

Practically, I think that if we can change the semantics of the Myanmar
script, our descendants can go back on the guarantee of no more
surrogates.

Richard.
Received on Mon Aug 22 2011 - 17:19:02 CDT

This archive was generated by hypermail 2.2.0 : Mon Aug 22 2011 - 17:19:04 CDT