RE: Is there Unicode mail out there?

From: Shigemichi Yazawa (yazawa@globalsight.com)
Date: Thu Jul 19 2001 - 03:03:55 EDT


At Wed, 18 Jul 2001 14:21:35 -0500,
Ayers, Mike <Mike_Ayers@bmc.com> wrote:
> So why not used tagged data to represent C0 and C1 characters? That
> is what XML is made of. As far as why control characters are not permitted,
> it seems to ma that this is so that XML documents can be passed around
> easily, through HTTP, email, FTP and so on, without loss of data. Protocols
> abound which interpret control characters, so XML files which contain data
> may get mangled or may mangle the systems which pass them. However, if that
> data is included as tagged hex digits, no problem will occur either way.

XML states "Its goal is to enable generic SGML to be served, received,
and processed on the Web in the way that is now possible with HTML."
But, in my opinion, XML has outgrown its original goal way too
far. XML seems to be used in every aspect of software engineering
these days.

Tagging disallowed characters is one way to work around the
problem. But I don't buy this solution for two reasons.

1. Markup is for describing a document's structure. 1 Introduction
   says "Markup encodes a description of the document's storage layout
   and logical structure." You could do something like <charEscape
   codepoint="000c" />. This doesn't express any structure of the
   document, though. Using a markup merely to escape a character is
   too hacky, in my opinion.

2. This is a proprietary solution. To get the original character, the
   apprication needs to know the semantics of the markup and needs to
   know how to decode the data appropriately. If it's the standard
   encoding like NCR, that's fine because everybody knows how to deal
   with it. But the tagging is specific to a DTD. It makes difficult
   to interchange the data.

This character restriction in XML makes a XML document creation
difficult. Say you have some data you want to wrap in XML. You don't
know much anout the content of the data. What you know about it is its
character encoding and that it is textual data. That's fine because
you just want to wrap it in XML. You would check if it contains "<"
or "&" and convert them to entity references. Or you are so lazy that
you want to put it in CDATA section without checking it at all. The
problem is that it might contain C0 control codes, which are legal
characters for most of the encodings. Unless you are absolutely sure
that the data doesn't contain any control codes, you have to check
every characters to make sure that you don't produce ill-formed XML
document. Even if you find a control, there isn't a standard way to
treat it. You end up deleting it or escaping it in a proprietary way.

-----------------
Shigemichi Yazawa
yazawa@globalsight.com



This archive was generated by hypermail 2.1.2 : Thu Jul 19 2001 - 04:39:15 EDT