L2/01-213
Suggestion regarding identifiers in XML (Name and Nmtoken)
for the next version of XML
2001-05-21
Kent
Karlsson
Syntax for XML 1.0 identifiers
Name ::= (Letter | '_' | ':')
(NameChar)*
Nmtoken ::= (NameChar)+
NameChar ::= Letter | Digit |
'.' | '-' | '_' | ':' | CombiningChar | Extender
/* the short dash there is probably meant to
be HYPHEN-MINUS... */
In annex B:
Letter ::= BaseChar |
Ideographic [intent: {Lu}, {Ll}, {Lt}, {Lo}, {Nl}?]
BaseChar ::= /* long list in annex B */
Ideographic ::= /* list of ranges in annex B */
Digit ::= /* list of ranges in annex B */ [intent: {Nd}, missing
{No}]
CombiningChar ::= /* list in annex B */ [includes {Mc}, {Mn}, {Me}(?)]
Extender ::= /* list in annex B */ [includes {Lm}?]
[I have not analysed the lists in annex
B, that is saved for a revised version of this document, but there is
commentary in that annex about intent.]
Notes:
Unicode identifier syntax
recommendations:
Identifier ::= IdentifierStart
(IdentifierStart | IdentifierExtend)*
IdentifierStart ::= {Lu} | {Ll}
| {Lt} | {Lm} | {Lo} | {Nl}
IdentifierExtend ::= {Mn} |
{Mc} | {Nd} | {Pc} | {Cf}
Notes:
Suggestion for new syntax for XML (v.
2.0?) identifiers
Assume that the Unicode version is declared
something like
<?xml version="2.0"
unicode-version="3.2"?>
for XML (v. 2.0(?) or later) documents, and
that the identifier syntax (for XML v.2.0) is tied to Unicode character
property values, rather than giving a list of code points directly in the
syntax for XML (v. 2.0?).
Identifier (Name and Nmtoken) identity should
be based on compatibility equivalence (see NFKC or NFKD), plus additional
equivalence of all {Pd}, equivalence of all {Pc}, and equivalence of full
stops. (If a subset of {Cf}s are allowed, they should be ignored for identifier
name equality.)
This proposal does not make any greater
effort at being backward compatible with strange edge cases that are allowed in
XML 1.0. They have hopefully never been used anyway...
Name ::= NameStart (Connect?
NmtokenStart)*
Nmtoken ::= NmtokenStart
(Connect? NmtokenStart)*
NameStart ::= Letter | ':' |
{Pc}
/*
[{Pc} generalises LOW LINE; move ':' and {Pc} to Connect!] */
NmtokenStart ::= NameStart |
{Nd} | {No}
Letter ::= ({Ll} | {Lu} | {Lt}
| {Lo} | {Lm} | {Nl}[??]) ({Mc} | {Mn})*
/* [{Nl}s and
other letters with compatibility decompositions will be
NFK*ed
away...; IDSes?; Hangul Jamo?; language independent
grapheme
syntax?] */
Connect ::= {Pd} | '.' |
{Ideographic full stop}
/*
[{Pd} generalises HYPHEN-MINUS; true apostrophe?; middle dot?] */
/*
[move ':' and {Pc} here?] */
This is just a first suggestion, and
comments are welcome.