Re: Is there Unicode mail out there?

From: Mark Davis (mark@macchiato.com)
Date: Tue Jul 17 2001 - 10:25:53 EDT


I had been told by the W3C people that the reason for forbidding control
characters in XML and HTML was for compatibility with SGML. I've never
checked it, since unfortunately the SGML standard is not online. If not
true, that's very interesting.

When you are thinking of XML as a general transmission mechanism for data
(not just a text document) it becomes clear. Suppose that you have a
database, of any sort. Some fields may or may not contain control
characters -- since control characters are perfectly legal in many if not
all databases. You want to query that database and get a selection, packaged
as XML.

Unfortunately, you have to invent your own home-brew quoting mechanism for
the control characters, since the standard XML does not permit you to
represent all of the -- perfectly valid -- characters in that database. And
such a home-brew mechanism will not interwork with anything else.

Conversely, you could filter out the control characters. That, of course,
would corrupt the data. Generally considered a bad thing.

Mark

—————

πάντων μέτρον ἄνθρωπος — Πρωταγόρας
[http://www.macchiato.com]

----- Original Message -----
From: "Lars Marius Garshol" <larsga@garshol.priv.no>
To: <unicode@unicode.org>
Sent: Tuesday, July 17, 2001 02:28
Subject: Re: Is there Unicode mail out there?

>
> * Mark Davis
> |
> | The HTML spec depends on the SGML spec for a characterization of
> | allowable characters. The latter, unfortunately, disallows some
> | valid Unicode characters (most C0 controls), but inconsistently
> | allows other similar characters (C1 controls).
>
> SGML is silent on the issue of what characters are allowed. It is the
> SGML declaration used by each application which decides this, and you
> can easily make an SGML declaration which allows every Unicode
> character.
>
> To wit:
>
> <!SGML "ISO 8879:1986 (WWW)"
> CHARSET
> BASESET "ISO Registration Number 177//CHARSET
> ISO/IEC 10646-1:1993 UCS-4 with
> implementation level 3//ESC 2/5 2/15 4/6"
> DESCSET 0 55296 0
> 55296 2048 UNUSED -- SURROGATES --
> 57344 1056768 57344
>
> CAPACITY SGMLREF
> TOTALCAP 150000
> GRPCAP 150000
> ENTCAP 150000
>
> SCOPE DOCUMENT
> SYNTAX
> SHUNCHAR NONE
> BASESET "ISO 646IRV:1991//CHARSET
> International Reference Version
> (IRV)//ESC 2/8 4/2"
> DESCSET 0 128 0 FUNCTION
> RE 13
> RS 10
> SPACE 32
> TAB SEPCHAR 9
>
> NAMING LCNMSTRT ""
> UCNMSTRT ""
> LCNMCHAR ".-_:"
> UCNMCHAR ".-_:"
> NAMECASE GENERAL YES
> ENTITY NO
>
> DELIM GENERAL SGMLREF
> HCRO "&#38;#x" -- 38 is the number for ampersand --
> SHORTREF SGMLREF
> NAMES SGMLREF
> QUANTITY SGMLREF
> ATTCNT 60 -- increased --
> ATTSPLEN 65536 -- These are the largest values --
> LITLEN 65536 -- permitted in the declaration --
> NAMELEN 65536 -- Avoid fixed limits in actual --
> PILEN 65536 -- implementations of HTML UA's --
> TAGLVL 100
> TAGLEN 65536
> GRPGTCNT 150
> GRPCNT 64
>
> FEATURES
> MINIMIZE
> DATATAG NO
> OMITTAG YES
> RANK NO
> SHORTTAG YES
> LINK
> SIMPLE NO
> IMPLICIT NO
> EXPLICIT NO
> OTHER
> CONCUR NO
> SUBDOC NO
> FORMAL YES
> APPINFO NONE
> >
>
> | That means that it is not possible in HTML (or more importantly, in
> | XML) to represent all valid Unicode characters in data fields.
>
> What would you want to use control characters for in an XML document?
>
> --Lars M.
>
>
>



This archive was generated by hypermail 2.1.2 : Tue Jul 17 2001 - 12:10:17 EDT