From: Jungshik Shin (jshin@mailaps.org)
Date: Fri Aug 20 2004 - 22:33:30 CDT
John Tisdale wrote:
> Unicode Fundamentals
> Early character sets were very limited in scope. ASCII required only 7 bits
> to represent its repertoire of 128 characters. ANSI pushed this scope 8 bits
> which represented 256 characters while providing backward compatibility with
> ASCII. Countless other character sets emerged that represented the
As is often the case, Unicode experts are not necessarily experts on
'legacy' character sets and encodings. The 'official' name of 'ASCII' is
ANSI X3.4-1968 or ISO 646 (US). While dispelling myths about Unicode,
I'm afraid you're spreading misinformation about what came before it.
The sentence that 'ANSI pushed this scope ... represents 256 characters'
is misleading. ANSI has nothing to do with various single, double,
triple byte character sets that make up single and multibyte character
encodings. They're devised and published by national and international
standard organizations as well as various vendors. Perhaps, you'd better
just get rid of the sentence 'ANSI pushed ... providing backward
compatibility with ASCII'.
> characters needed by various languages and language groups. The growing
> complexities of managing numerous international character sets escalated the
numerous national and vendor character sets that are specific to a
small subset of scripts/characters in use (or that can cover only a
small subset of ....)
> Two standards emerged about the same time to address this demand. The
> Unicode Consortium published the Unicode Standard and the International
> Organization for Standardization (ISO) offered the ISO/IEF 10646 standard.
A typo: It's ISO/IEC not ISO/IEF. Perhaps, it's not a typo. You
consistently used ISO/IEF in place of ISO/IEC ;-)
> Fortunately, these two standards bodies synchronized their character sets
> some years ago and continue to do so as new characters are added.
> Yet, although the character sets are mapped identically, the standards for
> encoding them vary in many ways (which are beyond the scope of this
> article).
I'm afraid that 'yet ...' can give a false impression that Unicode
consortium and
ISO/IEC have some differences in encoding standards especially
considering that the sentence begins with 'although ....identically'.
> Coded Character Sets
> A coded character set (sometimes called a character repertoire) is a mapping
> from a set of abstract characters to a set of nonnegative, noncontiguous
> integers (between 0 and 1,114,111, called code points).
A 'character repertoire' is different from a coded character set in
that it's more like a set of abstract characters **without** numbers
associated with them. (needless to say, 'a coded character set' is a set
of character-integer pairs)
> Character Encoding Forms
> The second component in Unicode is character encoding forms. Their purpose
I'm not sure whether 'component' is the best word to use here.
> The Unicode Standard provides three forms for encoding its repertoire
> (UTF-8, UTF-16 and UTF-32).
Note that ISO 10646:2003 also define all three of them exactly the same
as Unicode does.
> You will often find references to USC-2 and
> USC-4. These are competing encoding forms offered by ISO/IEF 10646 (USC-2 is
> equivalent to UTF-16 and USC-4 to UTF-32). I will not discuss the
UCS-2 IS different from UTF-16. UCS-2 can only represent a subset of
characters in Unicode/ISO 10646 (namely, those in BMP). BTW, it's not
USC but UCS. Also note that UTF in UTF-16/UTF-32/UTF-8 stand for either
'UCS Transformation Format' (UCS stands for Univeral Character Set, ISO
10646) or 'Unicode Transformation Format'
> significant enough to limit its implementation (as at least half of the 32
> bits will contain zeros in the majority of applications). Except in some
> UNIX operating systems and specialized applications with specific needs,
Note that ISO C 9x specifies that wchar_t be UTF-32/UCS-4 when
__STDC_ISO_10646__ is defined. Recent versions of Python also uses
UTF-32 internally.
> UTF-32 is seldom implemented as an end-to-end solution (yet it does have its
> strengths in certain applications).
> UTF-16 is the default means of encoding the Unicode character repertoire
> (which has perhaps played a role in the misnomer that Unicode is a 16-bit
> character set).
I would not say UTF-16 is the default means of encoding ..... It's
probably the most widely used, but that's different from being the
default ...unless you're talking specifically about Win32 APIs (you're
not in this paragraph, right?)
> UTF-8 is a variable-width encoding form based on byte-sized code units
> (ranging between 1 and 4 bytes per code unit).
The code unit of UTF-8 is an 8-bit byte just like the code unit of
UTF-16 and that of UTF-32 are a 16-bit 'half-word' and a 32-bit 'word',
respectively. A single Unicode character is represented with 1 to 4 code
units (bytes) depending on what code point it's assigned in the Unicode.
Please, see p. 73 of the Unicode standard 4.0
> In UTF-8, the high bits of each byte are reserved to indicate where in the
> unit code sequence that byte belongs. A range of 8-bit code unit values are
where in the code unit sequence that byte belongs.
> reserved to indicate the leading byte and the trailing byte in the sequence.
> By sequencing four bytes to represent a code unit, UTF-8 is able to
> represent the entire Unicode character repertoire.
By using one to four code units (bytes) to represent a character
> Character Encoding Schemes
> method. This issue is not relevant with UTF-8 because it utilizes individual
> bytes that are encapsulated with the sequencing data (with bounded look
> ahead).
'because ....' reads too cryptic. Why don't you just say that 'the
code unit in UTF-8 is a byte so that there's no need for serialization'
(i.e. sequences of code units in UTF-8 are identical to sequences of
bytes in UTF-8)
> Choosing an Encoding Solution
> high value in the multi-platform world of the Web. As such, HTML and current
> versions of Internet Explorer running on Windows 2000 or later use the UTF-8
> encoding form. If you try to force UTF-16 encoding on IE, you will encounter
> an error or it will default to UTF-8 anyway.
I'm not sure what you're trying to say here although I can't agree with
you more that UTF-8 is the most sensible choice to transmit information
(**serve** documents) over 'mostly' byte-oriented protocols/media such
as internet mail and html/xml (html/xml can be in UTF-16/UTF-32 as
well). As a web user agent/**client**, MS IE can (must) render
documents in UTF-16 just as well as documents in UTF-8 and many other
character encodings. It even supports UTF-7.
> valuable asset. The extra code and processor bandwidth required to
> accommodate variable-width code units can outweigh the cost of using 32-bits
> to represent each code unit.
You keep misusing 'code unit'. Code units cannot be of
variable-width. It's fixed in each encoding form. It's 8-bit in UTF-8,
16-bit in UTF-16 and 32-bit in UTF-32. The last sentence should end with
'to represent each character'.
> In such cases, the internal processing can be done using UTF-32 encoding and
> the results can be transmitted or stored in UTF-16
can be transmitted or stored in UTF-16 or UTF-8.
Hope this helps,
Jungshik
This archive was generated by hypermail 2.1.5 : Fri Aug 20 2004 - 22:35:58 CDT