Re: Is there Unicode mail out there?

From: Lars Marius Garshol (larsga@garshol.priv.no)
Date: Tue Jul 17 2001 - 05:28:00 EDT


* Mark Davis
|
| The HTML spec depends on the SGML spec for a characterization of
| allowable characters. The latter, unfortunately, disallows some
| valid Unicode characters (most C0 controls), but inconsistently
| allows other similar characters (C1 controls).

SGML is silent on the issue of what characters are allowed. It is the
SGML declaration used by each application which decides this, and you
can easily make an SGML declaration which allows every Unicode
character.

To wit:

<!SGML "ISO 8879:1986 (WWW)"
     CHARSET
          BASESET "ISO Registration Number 177//CHARSET
                    ISO/IEC 10646-1:1993 UCS-4 with
                    implementation level 3//ESC 2/5 2/15 4/6"
         DESCSET 0 55296 0
                 55296 2048 UNUSED -- SURROGATES --
                 57344 1056768 57344

CAPACITY SGMLREF
                TOTALCAP 150000
                GRPCAP 150000
                ENTCAP 150000

SCOPE DOCUMENT
SYNTAX
         SHUNCHAR NONE
         BASESET "ISO 646IRV:1991//CHARSET
                   International Reference Version
                   (IRV)//ESC 2/8 4/2"
         DESCSET 0 128 0 FUNCTION
                  RE 13
                  RS 10
                  SPACE 32
                  TAB SEPCHAR 9

         NAMING LCNMSTRT ""
                  UCNMSTRT ""
                  LCNMCHAR ".-_:"
                  UCNMCHAR ".-_:"
                  NAMECASE GENERAL YES
                           ENTITY NO

         DELIM GENERAL SGMLREF
                  HCRO "&#38;#x" -- 38 is the number for ampersand --
                  SHORTREF SGMLREF
         NAMES SGMLREF
         QUANTITY SGMLREF
                  ATTCNT 60 -- increased --
                  ATTSPLEN 65536 -- These are the largest values --
                  LITLEN 65536 -- permitted in the declaration --
                  NAMELEN 65536 -- Avoid fixed limits in actual --
                  PILEN 65536 -- implementations of HTML UA's --
                  TAGLVL 100
                  TAGLEN 65536
                  GRPGTCNT 150
                  GRPCNT 64

FEATURES
  MINIMIZE
    DATATAG NO
    OMITTAG YES
    RANK NO
    SHORTTAG YES
  LINK
    SIMPLE NO
    IMPLICIT NO
    EXPLICIT NO
  OTHER
    CONCUR NO
    SUBDOC NO
    FORMAL YES
  APPINFO NONE
>

| That means that it is not possible in HTML (or more importantly, in
| XML) to represent all valid Unicode characters in data fields.

What would you want to use control characters for in an XML document?

--Lars M.



This archive was generated by hypermail 2.1.2 : Tue Jul 17 2001 - 06:24:29 EDT