In this mail, I'm trying to deal with inter-related issues relevant to
three mailing lists: xml, html and unicode. First an extract from the
HTML 4.0 draft spec:
10.1.2 The SGML Declaration
<!SGML "ISO 8879:1986"
--
SGML Declaration for HyperText Markup Language version 4.0
With support for Unicode UCS-4 and increased limits
for tag and literal lengths etc.
--
CHARSET
BASESET "ISO Registration Number 177//CHARSET
ISO/IEC 10646-1:1993 UCS-4 with
implementation level 3//ESC 2/5 2/15 4/6"
DESCSET 0 9 UNUSED
9 2 9
11 2 UNUSED
13 1 13
14 18 UNUSED
32 95 32
127 1 UNUSED
128 32 UNUSED
160 2147483486 160
--
In ISO 10646, the positions with hexadecimal
values 0000D800 - 0000DFFF, used in the UTF-16
encoding of UCS-4, are reserved, as well as the last
two code values in each plane of UCS-4, i.e. all
values of the hexadecimal form xxxxFFFE or xxxxFFFF.
These code values or the corresponding numeric
character references must not be included when
generating a new HTML document, and they should be
ignored if encountered when processing a HTML
document.
--
The meanings of the three columns [let us call them A, B and C] of the
DESCSET are (if you are an SGML expert, please feel free to correct me):
B characters, starting at offset A in the document character set, are
defined by B characters, starting at offset C in the base character
set.
In the case of HTML 4.0, both the document character set and the base
character set are ISO 10646. The XML spec is confused in that it refers
to UCS-2 as the BASESET, yet speaks of ISO 10646 planes beyond the BMP.
Further confusion is caused by the difference between:
3.2.1: Coded Character Set
A Coded Character Set (CCS) is a mapping from a set of abstract
characters to a set of integers. Examples of coded character sets
are ISO 10646 [ISO-10646], US-ASCII [ASCII], and ISO-8859 series
[ISO-8859].
3.2.2: Character Encoding Scheme
A Character Encoding Scheme (CES) is a mapping from a Coded Character
Set or several coded character sets to a set of octets. Examples of
Character Encoding Schemes are ISO 2022 [ISO-2022] and UTF-8 [UTF-8].
A given CES is typically associated with a single CCS; for example,
UTF-8 applies only to ISO 10646.
The above quote is taken from RFC 2130, "The Report of the IAB Character
Set Workshop held 29 February - 1 March, 1996".
The BASESET should logically be a Coded Character Set, not a Character
Encoding Scheme. The HTML 2.0 spec contains an example of this:
CHARSET
BASESET "ISO 646:1983//CHARSET
International Reference Version
(IRV)//ESC 2/5 4/0"
DESCSET 0 9 UNUSED
9 2 9
11 2 UNUSED
13 1 13
14 18 UNUSED
32 95 32
127 1 UNUSED
BASESET "ISO Registration Number 100//CHARSET
ECMA-94 Right Part of
Latin Alphabet Nr. 1//ESC 2/13 4/1"
DESCSET 128 32 UNUSED
160 96 32
The second BASESET above is clearly a Coded Character Set, not a
Character Encoding Scheme. The characters in this Coded Character Set
are numbered from 32 (decimal). When this Coded Character Set is made
into a Character Encoding Scheme, character 32 is typically encoded as
160 (decimal).
At the moment, both HTML 4.0 and XML are using Character Encoding
Schemes in their BASESET declarations. One is using UCS-4, the other is
using UCS-2. I am working to get this changed by getting a new
registration into the International Register, which:
1. corresponds to ISO 10646/Unicode as a Coded Character Set, not
to any particular Character Encoding Scheme, and
2. corresponds to ISO 10646/Unicode after Amendments 1-7 and
includes all future Amendments which add characters but do not
change, move or remove them.
Finally, an extract from ISO 2375, which governs the International
Register. It sheds light on the possibility of getting an open-ended
registration accepted:
8 Revision procedure
8.1 In general no changes to registrations are permitted, ...
8.2 The Registration Authority may exceptionally grant a waiver to
international, governmental organisations issuing
internationally recognised and world-wide implemented standards.
However, the possibility that a registration may be modified in
future without allocation of a new escape sequence shall be
mentioned in the first application papers and in the register.
------------------------------------------------------------------------
Misha Wolf Email: misha.wolf@reuters.com 85 Fleet Street
Standards Manager Voice: +44 171 542 6722 London EC4P 4AJ
Reuters Limited Fax : +44 171 542 8314 UK
------------------------------------------------------------------------
Eleventh International Unicode Conference, Sep 2-5 1997, www.unicode.org
------------------------------------------------------------------------
Any views expressed in this message are those of the individual sender,
except where the sender specifically states them to be the views of
Reuters Ltd.
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:36 EDT