* Mark Davis
|
| The HTML spec depends on the SGML spec for a characterization of
| allowable characters. The latter, unfortunately, disallows some
| valid Unicode characters (most C0 controls), but inconsistently
| allows other similar characters (C1 controls).
SGML is silent on the issue of what characters are allowed. It is the
SGML declaration used by each application which decides this, and you
can easily make an SGML declaration which allows every Unicode
character.
To wit:
<!SGML "ISO 8879:1986 (WWW)"
CHARSET
BASESET "ISO Registration Number 177//CHARSET
ISO/IEC 10646-1:1993 UCS-4 with
implementation level 3//ESC 2/5 2/15 4/6"
DESCSET 0 55296 0
55296 2048 UNUSED -- SURROGATES --
57344 1056768 57344
CAPACITY SGMLREF
TOTALCAP 150000
GRPCAP 150000
ENTCAP 150000
SCOPE DOCUMENT
SYNTAX
SHUNCHAR NONE
BASESET "ISO 646IRV:1991//CHARSET
International Reference Version
(IRV)//ESC 2/8 4/2"
DESCSET 0 128 0 FUNCTION
RE 13
RS 10
SPACE 32
TAB SEPCHAR 9
NAMING LCNMSTRT ""
UCNMSTRT ""
LCNMCHAR ".-_:"
UCNMCHAR ".-_:"
NAMECASE GENERAL YES
ENTITY NO
DELIM GENERAL SGMLREF
HCRO "&#x" -- 38 is the number for ampersand --
SHORTREF SGMLREF
NAMES SGMLREF
QUANTITY SGMLREF
ATTCNT 60 -- increased --
ATTSPLEN 65536 -- These are the largest values --
LITLEN 65536 -- permitted in the declaration --
NAMELEN 65536 -- Avoid fixed limits in actual --
PILEN 65536 -- implementations of HTML UA's --
TAGLVL 100
TAGLEN 65536
GRPGTCNT 150
GRPCNT 64
FEATURES
MINIMIZE
DATATAG NO
OMITTAG YES
RANK NO
SHORTTAG YES
LINK
SIMPLE NO
IMPLICIT NO
EXPLICIT NO
OTHER
CONCUR NO
SUBDOC NO
FORMAL YES
APPINFO NONE
>
| That means that it is not possible in HTML (or more importantly, in
| XML) to represent all valid Unicode characters in data fields.
What would you want to use control characters for in an XML document?
--Lars M.
This archive was generated by hypermail 2.1.2 : Tue Jul 17 2001 - 06:24:29 EDT