RE: Chapter on character sets

From: Mike Brown (mbrown@corp.webb.net)
Date: Thu Jun 15 2000 - 13:59:03 EDT


Nice work, Lars.

A few months ago I started writing an XML tutorial for my coworkers, but got
so bogged down in understanding these issues that I decided it was better to
write it as "a reintroduction to XML with an emphasis on character
encoding." I still haven't finished it, but with a lot of help from the
Unicode list I got it to a point that I think is about 98% accurate.

http://www.skew.org/xml/tutorial/ is where it lives. Use it, digest it,
regurgitate it, and please, if there are any inaccuracies whatsoever, report
them to me!

I think I still need to clean up some ambiguities between IANA-registered
character set names and the names of the standards they are based on
("US-ASCII" vs ISO 646-US and ANSI X3.4 for example).

On that note, here is something I wrote for someone just a few minutes ago.
Can someone review the statements I make below regarding ASCII?

Thanks.

---

The ANSI X3.4 "ASCII" standard from 1968 defines character assignments for hex numbers 20 through 7E, and is pretty much just the things you see on American keyboards, minus the control functions like Shift, Enter, etc. In an 8-bit encoding scheme, the byte sequences used to represent the ASCII numbers are single bytes with the same value as the numbers themselves.

The ISO 646 standard was formalized in 1972, and provided variants of ASCII for different countries (ISO 646-XX, where XX is one of about a dozen country codes). In addition to the 20 through 7E range, it also includes the C0 control set for non-displayable characters assigned to 00 through 1F, and the delete character at 7F. If the ECMA-6 standard is as equivalent to ISO 646 as I am led to believe, then some leeway is allowed for currency symbols: hex position 23 can be # or £, and 24 can be $ or ¤.

The character set defined by the ISO 646-US standard is now known as "US-ASCII" due to its IANA registration for use on the Internet. It defines hex position 23 to be # and 24 to be $. It is a subset of all character encodings except IBM's EBCDIC, which is an encoding for mainframes that was supposedly easier to read on punch cards.

The ISO/IEC 8859-1 "Latin-1" standard defines character assignments for hex number A0 through FF, covering the characters used in the major (Western) European languages and that are not already covered by ASCII, and a few international symbols. This includes characters with diacritical marks/accents, «French quotation marks», non-breaking space, copyright symbol, etc.

"ISO-8859-1" (note the extra hyphen) is the IANA-registered character set that covers hex positions 00 to FF, subsetting US-ASCII and the C1 control set (80 to 9F).

US-ASCII is by definition a 7-bit character set, although 8-bit bytes are commonly used nowadays to transmit US-ASCII encoded character sequences. ISO-8859-1 requires 8-bit bytes. In either case, you have 1 byte per character.

The ISO/IEC 10646-1 "Universal Character Set" standard, which from a user's standpoint is equivalent to The Unicode Standard, defines character assignments for hex numbers 00 through 10FFFF, although not in a completely contiguous range. Since the range goes beyond FF, it cannot simply imply 1 byte per character like its predecessors. Thus, among other things, it introduces a distinction between the assignment of characters to numbers, and the conversion of numbers to sequences of bytes or other fixed-bit-width code values.

The UTF-8 amendment to the ISO/IEC 10646-1 standard defines an algorithm for converting the ISO 10646-1 character numbers to sequences of 1 to 4 8-bit bytes. It has also been formalized in the IETF's RFC 2279. "UTF-8" is also an IANA-registered character set.

In some ways, UTF-8 is nice, because if you are dealing mostly with ASCII-range characters, the mapping is 1-to-1 (US-ASCII 00-7F = UTF-8 00-7F) and you can use your favorite text editor, terminal display, or web browser with it, without caring whether the application is aware that it's dealing with UTF-8 and not ISO-8859-1 or an OS-sepecific encoding like Windows' CP-1252.

In other ways, UTF-8 is problematic, because most people aren't aware that ISO 8859-1 range characters don't enjoy the same 1-to-1 byte mapping, and they end up having problems when they try to work with those characters and their ISO-8859-1 byte values. It would help people to understand character encoding issues better if every UTF-8 sequence were always gibberish. This is actually the case for the people who don't use ASCII at all.

[I must credit Roman Czyborra for the ASCII and EBCDIC information that I gleaned from czyborra.com.]

- Mike ____________________________________________________________________ Mike J. Brown, software engineer at My XML/XSL resources: webb.net in Denver, Colorado, USA http://www.skew.org/xml/



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:03 EDT