Re: Is there Unicode mail out there?

From: Andy Heninger (andyh@jtcsv.com)
Date: Thu Jul 19 2001 - 12:53:08 EDT


I agree with the overall sentiment here, but here's one nit

> Or you are so lazy that
> you want to put it [your data] in CDATA section without checking it at
all.

CDATA sections have a severe problem, which is that there is no
way to escape otherwise legal XML characters that can't be
represented in the chosen document encoding.

The best bet is to avoid CDATA sections altogether.

Andy Heninger
IBM, Cupertino, CA
heninger@us.ibm.com

----- Original Message -----
From: "Shigemichi Yazawa" <yazawa@globalsight.com>
To: <unicode@unicode.org>
Sent: Thursday, July 19, 2001 12:03 AM
Subject: RE: Is there Unicode mail out there?

> At Wed, 18 Jul 2001 14:21:35 -0500,
> Ayers, Mike <Mike_Ayers@bmc.com> wrote:
> > So why not used tagged data to represent C0 and C1 characters? That
> > is what XML is made of. As far as why control characters are not
permitted,
> > it seems to ma that this is so that XML documents can be passed around
> > easily, through HTTP, email, FTP and so on, without loss of data.
Protocols
> > abound which interpret control characters, so XML files which contain
data
> > may get mangled or may mangle the systems which pass them. However,
if that
> > data is included as tagged hex digits, no problem will occur either
way.
>
> XML states "Its goal is to enable generic SGML to be served, received,
> and processed on the Web in the way that is now possible with HTML."
> But, in my opinion, XML has outgrown its original goal way too
> far. XML seems to be used in every aspect of software engineering
> these days.
>
> Tagging disallowed characters is one way to work around the
> problem. But I don't buy this solution for two reasons.
>
> 1. Markup is for describing a document's structure. 1 Introduction
> says "Markup encodes a description of the document's storage layout
> and logical structure." You could do something like <charEscape
> codepoint="000c" />. This doesn't express any structure of the
> document, though. Using a markup merely to escape a character is
> too hacky, in my opinion.
>
> 2. This is a proprietary solution. To get the original character, the
> apprication needs to know the semantics of the markup and needs to
> know how to decode the data appropriately. If it's the standard
> encoding like NCR, that's fine because everybody knows how to deal
> with it. But the tagging is specific to a DTD. It makes difficult
> to interchange the data.
>
> This character restriction in XML makes a XML document creation
> difficult. Say you have some data you want to wrap in XML. You don't
> know much anout the content of the data. What you know about it is its
> character encoding and that it is textual data. That's fine because
> you just want to wrap it in XML. You would check if it contains "<"
> or "&" and convert them to entity references. Or you are so lazy that
> you want to put it in CDATA section without checking it at all. The
> problem is that it might contain C0 control codes, which are legal
> characters for most of the encodings. Unless you are absolutely sure
> that the data doesn't contain any control codes, you have to check
> every characters to make sure that you don't produce ill-formed XML
> document. Even if you find a control, there isn't a standard way to
> treat it. You end up deleting it or escaping it in a proprietary way.
>
> -----------------
> Shigemichi Yazawa
> yazawa@globalsight.com
>
>



This archive was generated by hypermail 2.1.2 : Thu Jul 19 2001 - 14:24:19 EDT