RE: Non-ascii string processing?

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Tue Oct 07 2003 - 10:15:23 CST


Elliotte Rusty Harold wrote:
> A W3C XML Schema Language validator needs a character based API to
> correctly implement the minLength and maxLength facets on xsd:string

As far as I understand, xsd:string is a list of "Character"-s, and a
"Character" is an integer which can hold any valid Unicode code point.

In other terms, xsd:string is necessarily in UTF-32 (or something close to
it): it cannot be in UTF-8 or UTF-16.

The numbers returned by length, minLength and maxLength are the actual,
minimum and maximum number of *list elements*, contained in the list. I.e.,
in the case of xsd:string, the *size* of the string in *encoding units*.

The fact that, in UTF-32, the *size* of the sting in encoding units
corresponds to the number of "characters" is coincidental.

In any case, the useful information is always the *size* of the string in
encoding units (octets for UTF-8, 16-bit units for UTF-16, etc.), not the
number of "characters" it contains.

_ Marco



This archive was generated by hypermail 2.1.5 : Thu Jan 18 2007 - 15:54:24 CST