Re: XML Blueberry Requirements

From: Elliotte Rusty Harold (elharo@metalab.unc.edu)
Date: Thu Jun 21 2001 - 09:37:59 EDT


This is going out to three mailing lists. I'd like to add a fourth
and suggest that future discussion take place on xml-dev, which
probably has the broadest reach of interested parties.

Starting in Unicode 3.0 a number of new characters have been added both
for new scripts that were previously unencoded such as Amharic and
Cherokee as well as for old scripts that were incomplete such as
Chinese. The concern is that since XML 1.0 is based on Unicode 2.0,
"fully native-language XML markup is not possible in at least the
following languages: Amharic, Burmese, Canadian aboriginal languages,
Cantonese (Bopomofo script), Cherokee, Dhivehi, Khmer, Mongolian
(traditional script), Oromo, Syriac, Tigre, Yi. In addition, Chinese,
Japanese, Korean (Hangul script), and Vietnamese can make use of only a
limited subset of their complete character repertoires."

If this were true, it would be a very serious criticism of XML 1.0
Fortunately, however, the claim is not nearly as dire as the proposal
makes out. Indeed the proposal substantially overstates the need for any
changes. The XML 1.0 BNF productions do not allow these newly defined
characters to be used in element, attribute, and entity names. However,
they can be used in the text of element content and attribute values.
This means that XML is fully adequate for literature and data in
Amharic, Burmese, Canadian aboriginal languages, Cantonese, Cherokee,
Dhivehi, Khmer, Mongolian, Oromo, Syriac, Tigre, Yi, Mandarin, Japanese,
Korean, and Vietnamese. Only the markup, that is, the tags, would have
to be written in another script. Given that there aren't even localized
operating systems in most of these languages, and that today's software
effectively requires users to have a solid knowledge of at least the
ASCII characters, I don't think the need to write markup (as opposed to
text) in Cherokee justifies breaking backwards compatibility.

But wait! It's not even that bad. Several of the languages listed are
total red herrings. You most certainly can write markup in Cantonese,
Japanese, Korean, Mandarin, and Vietnamese today. The new characters
Unicode has added to these scripts are very obscure. In fact, experts
often disagree over whether some of them exist at all, or are merely
typographical variations of existing characters. Since the 1700s
Vietnamese has been written in a Latin-based alphabet that is fully
available in XML and that can write any Vietnamese word. Vietnamese only
uses the Han ideographs for classical documents and occasional signage
or decoration, and it seems very unlikely that a Vietnamese speaker
would write their markup using Han ideographs. Japanese has not one but
two phonetic alphabets that can write any Japanese word if the right Han
ideograph character is not encoded. Chinese speakers can use either
Latin characters or the native Bopomofo phonetic system for the very
rare cases where a character they need is not encoded. The fact is most
native speakers of Chinese, Japanese, Korean and Vietnamese do not
recognize the vast majority of these new characters, and the need for
them in markup (again, as opposed to text) is non-existent.

There are a few good points in this proposal. I'm sure there's an
occasional need for writing markup in Amharic, Burmese, Khmer,
Mongolian, Yi, and a few of the other languages the proposal lists. But
I don't believe there's enough of a need to justify breaking
compatibility with existing XML parsers, software, and systems. The XML
Blueberry Requirements vastly overstate the case by ignoring the
difference between markup and text in XML documents. I'd be willing to
break backwards compatibility to allow text in these languages if we had
to, but we don't. Text is already adequately handled by XML 1.0. All
we're arguing about now are the tags, and that's just not a strong
enough reason to break backwards compatibility.

-- 

+-----------------------+------------------------+-------------------+ | Elliotte Rusty Harold | elharo@metalab.unc.edu | Writer/Programmer | +-----------------------+------------------------+-------------------+ | The XML Bible (IDG Books, 1999) | | http://metalab.unc.edu/xml/books/bible/ | | http://www.amazon.com/exec/obidos/ISBN=0764532367/cafeaulaitA/ | +----------------------------------+---------------------------------+ | Read Cafe au Lait for Java news: http://metalab.unc.edu/javafaq/ | | Read Cafe con Leche for XML news: http://metalab.unc.edu/xml/ | +----------------------------------+---------------------------------+



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT