Re: Is there Unicode mail out there?

From: Mark Davis (mark@macchiato.com)
Date: Tue Jul 17 2001 - 17:30:32 EDT


> In that case the content of the field is not text but an octet string,
> and you need to do something different, like base64-ing it.

The content in the database is not an octet string: it is a text field that
happens to have a control code -- a legitimate character code -- in it.
Practically every database allows control codes in text fields. (And why are
C1 controls allowed? After all, they are even less frequent than C0
controls.)

Your task is to design an XML DTD to represent a selection from a database.
The database is nothing fancy: Latin-1 encoded. It is conceivable that a
control character is in one of the hundreds of thousands of records. Not
likely, but conceivable. You must guarantee no loss of data in the XML
representation of the data.

If XML could represent all control characters, then an instance of a
selection in XML might be as simple as the following.

<record>
  <firstname>John</firstname>
  <lastname>Smith</lastname>
  <birthdate>1950-10-10</birthdate>
...
</record>

The DTD would also be simple. Now, change the DTD (*and* the program that
interprets it) so that each and every text field could be a base64 instead.
Very ugly. You don't want to simply change all the fields to base64, since
that would (a) bulk them up and (b) make them unreadable for debugging. So
you end up having each field have two alternate representations. And in your
parser you have to be prepared for either, and in your generator you have to
pick between them.

Notice that for *any* database that allows control codes, to avoid data
corruption you would have to do such ugliness for any XML representation. Of
course, nobody does it, which means that there is always the opportunity for
data corruption. Of course, one might just not care -- after all, it would
be rare that this would cause a problem.

Mark

—————

πάντων μέτρον ἄνθρωπος — Πρωταγόρας
[http://www.macchiato.com]

----- Original Message -----
From: "John Cowan" <jcowan@reutershealth.com>
To: "Mark Davis" <mark@macchiato.com>
Cc: <unicode@unicode.org>; "Lars Marius Garshol" <larsga@garshol.priv.no>;
"Martin Duerst" <duerst@w3.org>
Sent: Tuesday, July 17, 2001 11:10
Subject: Re: Is there Unicode mail out there?

> Mark Davis wrote:
>
> > I had been told by the W3C people that the reason for forbidding control
> > characters in XML and HTML was for compatibility with SGML.
>
>
> More accurately, with the SGML default syntax, which is used in HTML
> and (with a few modifications) in XML.
>
>
> > When you are thinking of XML as a general transmission mechanism for
data
> > (not just a text document) it becomes clear. Suppose that you have a
> > database, of any sort. Some fields may or may not contain control
> > characters -- since control characters are perfectly legal in many if
not
> > all databases. You want to query that database and get a selection,
packaged
> > as XML.
>
>
> In that case the content of the field is not text but an octet string,
> and you need to do something different, like base64-ing it.
>
> --
> There is / one art || John Cowan <jcowan@reutershealth.com>
> no more / no less || http://www.reutershealth.com
> to do / all things || http://www.ccil.org/~cowan
> with art- / lessness \\ -- Piet Hein
>
>



This archive was generated by hypermail 2.1.2 : Tue Jul 17 2001 - 18:29:46 EDT