Re: Code pages and Unicode (wasn't really: RE: Endangered Alphabets)

From: Asmus Freytag <asmusf_at_ix.netcom.com>
Date: Fri, 19 Aug 2011 17:29:54 -0700

On 8/19/2011 3:24 PM, Ken Whistler wrote:
> On 8/19/2011 2:07 PM, Doug Ewell wrote:
>> Technically, I think 10646 was always limited to 32,768 planes so that
>> one could always address a code point with a 32-bit signed integer (a
>> nod to the Java fans).
>
> Well, yes, but it didn't really have anything to do with Java.
> Remember that Java
> wasn't released until 1995, but the 10646 architecture dates back to
> circa 1986.

Yep.

> So more likely it was a nod to C implementations which would, it was
> supposed,
> have implemented the 2-, 3-, or 4-octet forms of 10646 with a wchar_t,
> and which
> would have wanted a signed 32 bit type to work. I suspect, by the way,
> that that
> limitation was probably originally brought to WG2 by the U.S. national
> body,
> as they would have been the ones most worried about the C implementations
> of 10646 multi-octet forms.

No, it was the Japanese NB, as represented by the individual from Toppan
Printing.

This limitation was insisted upon in 1991, after the accord on the
merger between
Unicode and 10646, when 10646 was changed to use a "flat" codespace, not the
ISO 2022-like scheme.

>
> And the original architecture was also not really a full 32K planes in
> the sense
> that we now understand planes for Unicode and 10646. The original design
> for 10646 was for a 1- to 4-octet encoding, with all octets conforming
> to the
> ISO 2022 specification. It used the option that the "working sets" for the
> encoding octets would be the 94-unit ranges. So for G0: 0x21..0x7E and
> for G1: 0xA1..0xFE. The other bytes C0, 0x20, 0x7F, C1, 0xA0, 0xFF, were
> not used except for the single-octet form, as in 2022-conformant schemes
> still used today for some East Asian character encodings.
>
> And the octets were then designated G (group) P (plane) R (row) and C.
>
> The 1-octet form thus allowed 95 + 96 = 191 code positions.
>
> The 2-octet form thus allowed (94 + 94)^2 = 35,344 code positions
>
> The 3-octet form thus allowed (94 + 94)^3 = 6,644,672 code positions
>
> The Group octet was constrained to the low set of 94. (This is the origin
> of the constraint to half the planes, which would keep wchar_t
> implementations
> out of negative signed range.)
>
> The 4-octet form thus allowed 94 * (94 +94)^3 = 624,599,168 code positions
>
> The grand total for all possible forms was the sum of those values or:
>
> *631,279,375* code positions
>
> (before various *other* set-asides for "plane swapping" and private
> use start getting taken into account)

This was so mind-bogglingly complicated that it was a deal breaker for
many companies. Unicode's more restrictive concept of a character or its
combining technology or many other innovations weren't initially seen as
its primary benefits by people being faced with evaluating the
differences between the formal ISO-backed project and the de-facto
industry collaboration forming around Apple and Xerox. But the flat code
space, now you were talking.
>
>> Of course, 2.1 billion characters is also overkill, but the advent of
>> UTF-16 was how we ended up with 17 planes.
>
> So a lot less than 2.1 billion characters. But I think Doug's point is
> still valid:
> 631 million plus code points was still overkill for the problem to
> be addressed.
>
> And I think that we can thank our lucky stars that it isn't *that*
> architecture for
> a universal character encoding that we would now be implementing and
> debating on
> the alternative universe version of this email list. ;-)

Even remembering it makes my head hurt.

A./
Received on Fri Aug 19 2011 - 19:31:37 CDT

This archive was generated by hypermail 2.2.0 : Fri Aug 19 2011 - 19:31:37 CDT