Re: Rationale wanted for Unicode identifier rules

From: John Cowan (jcowan@reutershealth.com)
Date: Fri Mar 10 2000 - 17:47:43 EST


Mark Davis wrote:

> <&#x30CD;&#x30E0;>name</&#x30CD;&#x30E0;>
> is a valid segment of an XML 1.0 file, with English data, and the name of the data field is Japanese Katakana "NEMU".

In fact this is not well-formed XML (in XMLspeak, "valid" means something different),
because &#xNNNN; constructs are not allowed *within* identifiers. So what is
well-formed and what is not, in this particular case, depends on the charset.
In UTF-8 and UTF-16, which all XML parsers must understand, any legal XML
identifier can be used, but in ASCII, only
the ASCII subset is usable.
 
> 1. Status quo.
> Never accept characters outside of Unicode 2.0 in identifiers. Downside: new
> scripts, and additions (e.g. CJK ideographs) to existing scripts are disadvantaged
> -- forever.

I think this is unacceptable, and others agree.

> 2. Successive upgrades.
> Revise XML with each version of Unicode. This means you will have XML
> 1.0-compliant parsers, XML 1.1 compliant parsers, etc.. Disadvantage:
> it takes years for compliant parsers to be fully spread across the world.
> During that time, data interchange between different versions of parsers
> cannot be guaranteed. I believe this will be unacceptable to the XML community.

I think so too.

> 3. Open Season.
> Define identifiers to be *fixed* as of Unicode 3.0, but to also include
> unassigned characters (Cn) as of that version.

I think a modified version of this to be the best option, somewhat as
follows:

        Processes that accept XML MUST accept characters
                they believe to be unassigned.
        Processes that generate XML MUST NOT generate characters
                they believe to be unassigned.

With this scheme, a Unicode 3.0 acceptor will accept letters and digits
assigned in Unicode 4.0 and generated by a Unicode 4.0 process, but even
though it would accept symbols from Unicode 4.0, a Unicode 4.0 generator
process will never generate any.

> 4. Restricted Open Season.
> The Unicode consortium divides up the unassigned space in more detail,
> and specifies that <excluded> characters and <identifier_extend> characters
> can only be allocated in the future within certain blocks. This has the
> effect of dividing Cn into subcategories: Cni, Cne, and Cnx. While
> characters will change from each of these to other properties over time
> as characters become assigned, the three relevant categories will remain unchanged.

Nice if you can get it, but probably not available, given the tendency to
allocate script-specific puncts and symbols next to the letters and digits.

-- 

Schlingt dreifach einen Kreis vom dies! || John Cowan <jcowan@reutershealth.com> Schliesst euer Aug vor heiliger Schau, || http://www.reutershealth.com Denn er genoss vom Honig-Tau, || http://www.ccil.org/~cowan Und trank die Milch vom Paradies. -- Coleridge (tr. Politzer)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT