From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Aug 12 2005 - 14:56:06 CDT
From: "Michael (michka) Kaplan" <michka@trigeminal.com>
> Markets and software products and keyboards and fonts work to define
> characters to use. Unicode does not really do this.
If you mean the joint Unicode/ISO 10646 standard here, you're right: there's
only one encoding.
However the Unicode Consortium hosts the CLDR registry which tries to define
such minimal subsets for supporting each language. This registry is not a
standard for now, but a joint effort to harmonize locale data across
systems/platforms, something needed to build such portable keyboards, fonts,
applications and so on. So the CLDR project will help increase
interoperability of systems designed to support well-categorized families of
languages.
What is important is not to mix the various weak definitions of the
"charset" term. In legacy applications, the term is an abbreviation that
refers both to the three-way association of
- a set of abstract characters (preferably mapped one-to-one into the
Unicode/ISO 10646 standard repertoire, but this is not an obligation
observed by many legacy or application-specific new charsets),
- with an binary encoding to represent them with code positions,
- and with a serialization scheme to build and interpret encoded streams of
bytes as code positions.
In Unicode/ISO 10646 the code positions are preferably called "code points",
because Unicode/ISO10646 is now used as the internal codification to map
almost all other charsets (so when studying these charsets, we need two
terms to make the distinction between their intrinsic "code positions", and
the represented Unicode "code points" to which they are mapped).
"charset" must not be confused with "character set" which refers only to a
set of abstract characters (this set is called a "repertoire"),
independantly of its encoding, and independantly of the fact that this
repertoire *may* contain abstract characters absent from the standard
Unicode/ISO 10646 repertoire.)
The Unicode/ISO 10646 repertoire has the vokation of containing almost all
other repertoires, provided that these repertoires refer to abstract
characters that are not specific to a private application (for example the
legacy MacOS Roman repertoire contain an abstract character which represents
the Apple logo, a abstract character which is absent from the Unicode/ISO
10646).
For these last "missing" characters, the Unicode/ISO 10646 offers ways to
"map" them to codepoints, using a private agreement (which can be formulated
by a character mapping table) and mapping these characters to special
characters in the "Private Use Area" where Unicode/ISO 10646 has normally
defined no semantics, and where no standard abstract character will ever be
encoded. This way, the ISO/10646 offers effectively a way to map all other
legacy repertoires, including those that contain abstract characters absent
from the standard ISO 10646 repertoire.
The use of non standard characters is not recommanded in applications built
and tested to work with the ISO 10646 repertoire only. But under this
limitation, the legacy charsets that contain these characters can be used
and interchanged safely (for example it's safe to interchange text data
encoded with MacOS Roman, provided that it does not contain the Apple logo
character).
This archive was generated by hypermail 2.1.5 : Fri Aug 12 2005 - 15:00:40 CDT