Validation of Character Repertoires for XML Documents
Intended Audience: |
Software Engineers, Web Administrators, XML Users |
Session Level: |
Intermediate |
XML is based on Unicode, and therefore XML documents may use the
full Unicode character repertoire. However, XML-based applications
often use XML interfaces to legacy software which in many cases is
not capable of dealing with the full Unicode character repertoire.
We therefore propose a schema language for XML which is capable of
limiting the character repertoire of XML documents. This schema language, called Character Repertoire Validation for
XML (CRVX), has features to permit or disallow character repertoire
subsets from certain parts of an XML document, for example only for
element and attribute names. CRVX uses information from the Unicode
Character Database (UCD) to make character repertoire specification
as easy as possible. CRVX is not intended to be the only schema language in an XML
application scenario, but it provides useful additional
schema-based validation to protect applications from unsupported
characters. XML applications typically combine different schema
languages before processing XML documents, and CRVX is intended to
complement other schema languages such as grammar-based languages
(DTD, XML Schema) or rule-based languages (Schematron). CRVX can be implemented in various ways. One simple solution is
to use XSLT to transform an CRVX schema into an XSLT program, which
is then used to validate XML documents. We briefly describe such an
implementation. Other (and more efficient) implementations could be
based on DOM or SAX parsers. |