Re: Code pages and Unicode

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Thu, 25 Aug 2011 03:45:56 +0100

On Wed, 24 Aug 2011 17:07:03 -0700
Ken Whistler <kenw_at_sybase.com> wrote:

> > <Snip> The
> > BMP is littered with concessions to the limitations of rendering
> > systems - precomposed characters, Hangul syllables and Arabic
> > presentation forms are the most significant.
 
> Those are not concessions to "the limitations of rendering systems"
> -- they are concessions to the need to stay compatible with the
> character encodings of legacy systems, which had limitations for
> their rendering systems.

Which earlier coding system supported Welsh? (I'm thinking of 'W WITH
CIRCUMFLEX', U+0174 and U+0175.) How was the use of the canonical
decompositions incompatible with the character encodings of legacy
systems? Latin-1 has the same codes as ISO-8859-1, but that's as far
as having the same codes goes. Was the use of combining jamo
incompatible with legacy Hangul encodings?

> >>> > > I think, however, that<high><high><rare
> >>> > > BMP code><low> offers a legitimate extension mechanism
> >> > One could argue about the description as "legitimate". It is
> >> > clearly not conformant,

> In whichever encoding form you choose to specify, the sequence
> <high><high> is non-conformant. Not merely a possibly new type of
> code unit sequence.
>
> <D800 D800> is non-conformant UTF-16
>
> <0000D800 0000D800> is non-conformant UTF-32
>
> <ED A0 80 ED A0 80> is non-conformant UTF-8

<high><low> is also non-conformant UTF-8 and UTF-32.

Obviously <D800 D800 000E DC00> is non-conformant with current UTF-16.
Remembering that there is a guarantee that there will be no more
surrogate points, an extension form has to be non-conformant with
current UTF-16!

> >> > I see no chance of that happening for either the Unicode
> >> > Standard or 10646.
> > It will only happen when the need becomes obvious, which may be
> > never, or may be 30 years hence. It's even conceivable that UTF-16
> > will drop out of use.
>
> Could happen. It still doesn't matter, because such a proposal also
> breaks UTF-8 and UTF-32.

Everyone should know how to extend UTF-8 and UTF-32 to cover the 31-bit
range. Just go back to the old ISO 10646 definitions! UTF-16 is the
problem.

Past suggestions have included a new set of surrogate points, which
would restrict the numbers that could represent characters. One might,
for instance, allocate U+B0000 to U+BFFDD to 'high extended surrogates'
and U+C0000 to U+C7FFF to 'low extended surrogates'. That's a lot of
codepoints so that a 31-bit number can be expressed in 64 bits and
could easily be rendered impossible by a few random assignments.
(Using three surrogates would be more economical in codepoints - one
could even do <high1><high2><low3><low4> with high2 having a restricted
range taking out just 2^11 codepoints from the supplementary planes.)

Andrew reasonably asked whether an extension *could* be done without
creating more surrogates. All the solutions we've thought of
affect searching for a single character - using an ISO 2022 escape code
is probably the worst of them from this point of view.

Richard.
Received on Wed Aug 24 2011 - 21:48:21 CDT

This archive was generated by hypermail 2.2.0 : Wed Aug 24 2011 - 21:48:21 CDT