Re: _Unicode_code_page_and_?.net from Asmus Freytag on 2013-07-30 (Unicode Mail List Archive)

From: Asmus Freytag <asmusf_at_ix.netcom.com>
Date: Tue, 30 Jul 2013 15:27:55 -0700

On 7/30/2013 2:15 PM, Doug Ewell wrote:
> Asmus Freytag <asmusf at ix dot netcom dot com> wrote:
>
>>> A code page is not, in general,
>>> the same as an encoding scheme.
>> What is, then, the proper definition of a "code page"?
> I might not be able to do better than Potter Stewart here. I think of a
> code page as a deliberately targeted subset of all encodable characters,
> such that different "pages" make up the whole "book." The Unicode
> Glossary uses the example of MS-DOS code page 437; the concept wouldn't
> apply unless other pages existed, covering different repertoires.
I'm not privy to the thinking behind the actual origin of the term, but
I always assumed that the term "page" was chosen in analogy to the way
one speaks of a "page" of memory - something that can be swapped in and out.

So, by selecting a different code page, one would swap the definition of
the bytes in a byte stream such that they resulted in different
displayed elements (and correspond to different key strokes).

The early code pages were small and fixed width (code unit == code
point), so this kind of image makes sense.

Later the concept was effectively first generalized to any kind of
character set, but also any kind of encoding scheme. East Asian
character sets could exist in multiple encoding schemes that (with some
limited differences related to ASCII characters) encoded the same
repertoire but with different byte sequences.

With Unicode (by design, if not always 100% in actuality) created to
allow loss-less mappings from pre-existing character sets, the code page
identifier doubled as mapping identifier, and is very widely used for
that purpose.

That doesn't mean that mappings are code pages.

Whether Unicode is a "code page" is something that you can argue up and
down. In the original scheme, as extended, it very naturally can be just
another code page - in the architectures that support it, it has a
different place, due to its nature as universal mapping target.

In the end, the universal nature of Unicode means that all sorts of
architectures that depended on swapping character sets (code pages) in
mid stream are no longer viable and have been replaced by this single
superset. Code pages live on, only to describe data and devices that are
stuck in a particular past. (Even if they are relatively alive and
kicking like 8859-1 or Windows 1252).

I'm happy to think of Unicode as something outside the old code page
definition, but also as the "code page to end all code pages". Both work
for me, so seeing code page id's defined for all the encoding schemes
doesn't worry me.

A./
>
>> Later, it was realized that in order to specify what encoding data
>> were in or, for example, to specify a conversion from UTF-7 and UTF-8
>> to UTF-16 (native encoding scheme) one needed some suitable ID number
>> to identify the mapping. Well, extending the code page id was the most
>> natural way to do that, because, on several platforms, the use of a
>> numerical ID from the IBM code page registry was established practice.
> I don't think the existence of numeric identifiers for Unicode encoding
> schemes suffices to make them "code pages."
>
> --
> Doug Ewell | Thornton, CO, USA
> http://ewellic.org | @DougEwell 
>
>
Received on Tue Jul 30 2013 - 17:32:04 CDT

This archive was generated by hypermail 2.2.0 : Tue Jul 30 2013 - 17:32:06 CDT