Re: Chapter on character sets

From: Antoine Leca (Antoine.Leca@renault.fr)
Date: Fri Jun 16 2000 - 06:11:13 EDT


Keld Jørn Simonsen wrote:
>
> On Thu, Jun 15, 2000 at 09:49:14AM -0800, Mike Brown wrote:
> >
> > The character set defined by the ISO 646-US standard is now known as
> > "US-ASCII" due to its IANA registration for use on the Internet. It defines
> > hex position 23 to be # and 24 to be $. It is a subset of all character
> > encodings except IBM's EBCDIC, which is an encoding for mainframes that was
> > supposedly easier to read on punch cards.
>
> About the subset, this is not true. There are charsets in use today,
> like the national 646 variants, that differ (in the 12 unassigned
> positions). Not much used, but I get some emails in these encodings still.

What is the character set used by the Minitel? I am not sure here,
Keld probably knows better. (The Minitel is a French view of Internet,
20 years ago; it is still quite used in France).

> Also Japanese and Chinese 14-bit encodings have discrepancies (mostly the ¥)

Also Korean have the won where the \ usually lies.

 
> > "ISO-8859-1" (note the extra hyphen) is the IANA-registered character set
> > that covers hex positions 00 to FF, subsetting US-ASCII and the C1 control
> > set (80 to 9F).
>
> It is only C0 and C1 which are added. (and 7F)

Huh? What for a content of C1 is defined for iso-8859-1?

> There are a number of character sets, quite old, that had 2 bytes
> per character, as in east-asian charsets,and a family of 8/16 bit
> charsets, like ISO 6937, T.61 and some bibliographic charsets.
> RFC 1345 has a number of these.

There are even coded character sets with 3 bytes!

 
> > It would help people to understand character
> > encoding issues better if every UTF-8 sequence were always gibberish.

It would completely defeat the purpose.

> > This is actually the case for the people who don't use ASCII at all.

Actually, space is encoded 20 for them too.
And if they are using , . or -, they appear this way too. Also if
they are using the 0-9 digits, they also appear this way.

If you want to make things completely gribberish, please use EBCDIC.
Or better yet, view non-EBCDIC on a EBCDIC terminal (because then,
space is transformed to a control character! Very funny)
 

Antoine



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT