Re: Is there Unicode mail out there?

From: Mark Davis (mark@macchiato.com)
Date: Mon Jul 16 2001 - 16:09:07 EDT


The HTML spec depends on the SGML spec for a characterization of allowable
characters. The latter, unfortunately, disallows some valid Unicode
characters (most C0 controls), but inconsistently allows other similar
characters (C1 controls). That means that it is not possible in HTML (or
more importantly, in XML) to represent all valid Unicode characters in data
fields.

Mark

—————

πάντων μέτρον ἄνθρωπος — Πρωταγόρας
[http://www.macchiato.com]

----- Original Message -----
From: "Shigemichi Yazawa" <yazawa@globalsight.com>
To: <mark@macchiato.com>
Cc: <unicode@unicode.org>; <everson@indigo.ie>
Sent: Monday, July 16, 2001 12:12
Subject: Re: Is there Unicode mail out there?

> At Sat, 14 Jul 2001 09:49:30 -0700,
> Mark Davis <mark@macchiato.com> wrote:
> >
> > No, but it is for the vast majority.
> >
> > Some have to be written specially, e.g. &lt;
>
> I looked at XML 1.0 spec and it says in 2.4 Character Data and Markup
> that
>
> "If they are needed elsewhere, they must be escaped using either
> numeric character references or the strings "&amp;" and "&lt;"
> respectively."
>
> I also looked at HTML 4.01 spec and it doesn't say in 5.3.2 Character
> entity references that &#60; cannot be used to represent "<".
>
> > Some cannot be written at all, e.g. U+0007 (but U+0087 can be!)
>
> This is true for XML, but I couldn't find any statement in HTML 4.01
> spec to restrict the use of U+0007 in HTML document.
>
> By the way, I have been pondering why, in XML, all the C1 control
> characters are legal but some of the C0 control characters are
> not. 2.2 Characters says that "Legal characters are tab, carriage
> return, line feed, and the legal characters of Unicode and ISO/IEC
> 10646." and the BNF for Char is this.
>
> [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | /* any Unicode
character,
> [#xE000-#xFFFD] | [#x10000-#x10FFFF] excluding the surrogate
blocks,
> FFFE, and FFFF. */
>
> Does this mean C0 controls are not legal Unicode characters?
>
> -------------------
> Shigemichi Yazawa
> yazawa@globalsight.com
>
>



This archive was generated by hypermail 2.1.2 : Mon Jul 16 2001 - 17:04:32 EDT