Re: Rationale wanted for Unicode identifier rules

From: Mark Davis ([email protected])
Date: Mon Mar 06 2000 - 10:52:03 EST

Next message: Peter Constable: "Re: U+xxxx, U-xxxxxx, and the basics"
Previous message: Dan Oscarsson: "Re: U+xxxx, U-xxxxxx, and the basics"
In reply to: Tex Texin: "Re: Rationale wanted for Unicode identifier rules"
Next in thread: Tex Texin: "Re: Rationale wanted for Unicode identifier rules"
Reply: Tex Texin: "Re: Rationale wanted for Unicode identifier rules"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

1. There are a few differences between XML and PDF. XML is a meta-data structure, and is intented to be stable across the whole world of different data applications. With PDF, we are typically talking about a single source vendor. Making a conformant change in XML affects hundreds of conformant parsers, plus the tens of thousands of different products that those browsers will be used in, in many complex business-to-business applications. You don't want one link in the chain to be broken, just because of a difference in identifier syntax.

2. For those unfamiliar with XML, the identifiers in question are part of the structure of the data file, not part of the actual data. The data itself can be Unicode 3.0 (or 4.0, or whatever).

For data, see:
http://www.w3.org/TR/REC-xml#NT-Char

For identifiers, see:
http://www.w3.org/TR/REC-xml#NT-NameChar
which depend on:
http://www.w3.org/TR/REC-xml#NT-Letter
http://www.w3.org/TR/REC-xml#NT-Digit
http://www.w3.org/TR/REC-xml#NT-CombiningChar
http://www.w3.org/TR/REC-xml#NT-Extender

For example:

<name>ꀀ&#xA001</name>
is a valid segment of an XML 1.0 file, with a Yi data field, where the name of the data field is the English word "name". (The Yi is excaped in this case, since I am using an ASCII encoding. If we were exchanging UTF-8 it wouldn't be.)

<ネム>name</ネム>
is a valid segment of an XML 1.0 file, with English data, and the name of the data field is Japanese Katakana "NEMU".

<ꀀ&#xA001>name</ꀀ&#xA001>
is invalid XML 1.0. It cannot be accepted by any conformant XML parser. The reason is that it uses Yi letters in a field name, which is not allowed.

3. For me, the question is the lesser of different evils. Allowing some future symbols into identifiers seems much, much less problematic than rendering XML parsers periodically non-conformant. Ideally (from my point of view) the UTC would allocate blocks, as I discussed in my previous message, and avoid the problem. However, the issue is not cut and dried; reasonable people can disagree on this, of course.

Mark

Tex Texin wrote:

> Mark,
> I understood your concern for data and interoperability. I don't
> understand why XML is more of a concern than other widely used
> file formats.
>
> 1) all files do not use all characters
> 2) all files are not read by all users
>
> So the upgrade requirements can be attributed to the users that need to
> read the files with the
> new characters. It does not need to be a constraint on the file
> producers (although
> producers will probably identify and recognize their target market's
> needs.) Version numbering
> simplifies the determination if the file will be readable.
>
> This situation occurs analagously for many other file formats and
> products and seems
> to work itself out. Initially, producers don't use the new version
> unless it provides functionality
> needed for the particular task. At some point the number of documents
> using the new version
> is large enough that the majority of users are forced to upgrade.
>
> PDF went from 3 to 4, Word for Windows changes formats with new
> versions... These are also
> widely distributed file formats that have the same interoperability
> issues. I don't see that
> XML is in a special class, where we need to constrain producers. We can
> let the market
> sort it out and I don't think it will be that destructive. (Maybe that
> is the point I am
> missing.)
>
> At least we have a great advantage today, over previous file formats, in
> that Unicode 3.0
> certainly accomodates a greater collection of characters, than the world
> has had before,
> especially relative to past requirements for multi-codepage
> implementations.
> Once XML moves to Unicode 3.0, subsequently here won't be as many files
> needing
> post 3.0 characters. (Certainly, there will be need for new characters,
> but not
> as many files will need these characters...)
>
> tex
>
> Mark Davis wrote:
> >
> > Because XML files will be distributed broadly, as broadly as HTML, that means is that in effect nobody can use the new characters, because you can't be sure that parsers on the other side will accept them. This is different than the compiler case, where you have much more control over the versioning. Since this is data, it can potentially end up anywhere.
> >
> > Your suggestion of adding the additional identifier characters (categorized into 2 groups) is a possibility. If restricted to the actual characters required by the document it might not be too bad. I'd have to ask the parser people whether this would be easy or hard (it would definitely impact performance, I suspect).
> >
> > Mark
> >
> > Tex Texin wrote:
> >
> > > On successive upgrades, #2-
> > >
> > > No language can guarantee downward compatibility (without
> > > remaining static itself).
> > >
> > > As long as character properties, once defined, do not change, then
> > > at least XML can have upward compatibility, so something defined
> > > with XML 1.0 should continue to work with successive upgrades.
> > >
> > > I don't see why it would be unacceptable to have the scenario
> > > where XML 1.0 files continue to work, but something that
> > > takes advantage of functionality in a later version requires
> > > that version or later. It's pretty much the way of the world.
> > >
> > > There is perhaps one other scenario-
> > > If there is a way for an XML file to optionally carry a definition of
> > > character properties with it, then it can be downward compatible.
> > > Of course you wouldn't want to define all characters, maybe just those
> > > define later than some version. Then it would be able to be parsed
> > > by versions down to whatever version was desired.
> > >
> > > (Perhaps you would want to validate that no character that was defined
> > > in the parser was not having its properties changed or overridden.
> > > I am not sure if this is needed or not yet.)
> > >
> > > It would mean a parser would have to be able to append to its
> > > character property table for the duration of the processing of the
> > > file and then return to its original state for the next file.
> > >
> > > tex
> > >
> > > Mark Davis wrote:
> > > >
> > > > In general, I agree with the discussion here: identifiers should be chosen on the basis of character properties. As new characters are assigned, they are given appropriate properties, and the class of possible identifiers grows. There are, however, difficulties with this approach in certain contexts.
> > > >
> > > > Take, for example, XML identifiers. The difference in this case is that the identifiers occur in structured data, not program text. This data will live for years. The conformance requirements for XML identifiers are very strict. This is absolutely correct, since it guarantees compatibility around the world. But what this means is that the current, conformant XML parsers cannot accept new Unicode 3.0 letters in identifiers. There are a few main approaches to identifiers in XML, listed below.
> > > >
> > > > [One note that is relevant to all of these: while <identifier_extend> includes Cf, these character should be filtered when composing identifiers, so there are actually 4 relevant categories for parsing identifiers. However, there are reserved blocks (2060-206F and E0000-E1000) now for Cf characters, so they should not present a problem.]
> > > >
> > > > 1. Status quo.
> > > > Never accept characters outside of Unicode 2.0 in identifiers. Downside: new scripts, and additions (e.g. CJK ideographs) to existing scripts are disadvantaged -- forever.
> > > >
> > > > <identifier_start> := Lu, Ll, Lt, Lm, Lo, Nl -- as of Unicode 2.0
> > > > <identifier_extend> := Mn, Mc, Nd, Pc, Cf -- as of Unicode 2.0
> > > > <excluded> := Me, No, Zs, Zl, Zp, Cc, Pd, Ps, Pe, Pi, Pf, Po, Sm, Sc, Sk, So -- as of Unicode 2.0
> > > >
> > > > 2. Successive upgrades.
> > > > Revise XML with each version of Unicode. This means you will have XML 1.0-compliant parsers, XML 1.1 compliant parsers, etc.. Disadvantage: it takes years for compliant parsers to be fully spread across the world. During that time, data interchange between different versions of parsers cannot be guaranteed. I believe this will be unacceptable to the XML community.
> > > >
> > > > 3. Open Season.
> > > > Define identifiers to be *fixed* as of Unicode 3.0, but to also include unassigned characters (Cn) as of that version.
> > > >
> > > > Identifiers are thus fixed for all time. They include all new letters that will be defined. Disadvantage: they will also include new punctuation, symbols, etc. defined post-3.0.
> > > >
> > > > <identifier_start> := Lu, Ll, Lt, Lm, Lo, Nl, AND Cn -- as of Unicode 3.0
> > > > <identifier_extend> := Mn, Mc, Nd, Pc, Cf, AND Cn -- as of Unicode 3.0
> > > > <excluded> := Me, No, Zs, Zl, Zp, Cc, Pd, Ps, Pe, Pi, Pf, Po, Sm, Sc, Sk, So -- as of Unicode 3.0
> > > >
> > > > 4. Restricted Open Season.
> > > > The Unicode consortium divides up the unassigned space in more detail, and specifies that <excluded> characters and <identifier_extend> characters can only be allocated in the future within certain blocks. This has the effect of dividing Cn into subcategories: Cni, Cne, and Cnx. While characters will change from each of these to other properties over time as characters become assigned, the three relevant categories will remain unchanged.
> > > >
> > > > Since future allocations will not disturb the identifier syntax, identifiers are thus fixed for all time. Disadvantage: the consortium as a whole has resisted such assignment of blocks for unassigned characters in the past (except Cf).
> > > >
> > > > <identifier_start> := Lu, Ll, Lt, Lm, Lo, Nl, AND Cni
> > > > <identifier_extend> := Mn, Mc, Nd, Pc, Cf, AND Cne
> > > > <excluded> := Me, No, Zs, Zl, Zp, Cc, Pd, Ps, Pe, Pi, Pf, Po, Sm, Sc, Sk, So, AND Cnx
> > > >
> > > > Mark
> > >
> > > --
> > > Progress is a proud sponsor of the 16th International Unicode Conference
> > > March 27-30, 2000 in Amsterdam, Holland
> > > http://www.unicode.org/iuc/iuc16/index.html
> > > See our panel on Open Source Approaches to Unicode Libraries
> > > http://www.unicode.org/iuc/iuc16/a206.html
> > > ------------------------------------------------------------------------------------------------
> > > Tex Texin Director, International Products
> > >
> > > Progress Software Corp. +1-781-280-4271
> > > 14 Oak Park +1-781-280-4655 (Fax)
> > > Bedford, MA 01730 USA [email protected]
> > >
> > > http://www.progress.com The #1 Embedded Database
> > > http://www.SonicMQ.com JMS Compliant Messaging- Best Middleware
> > > Award
> > > http://www.aspconnections.com Leading provider in the ASP marketplace
> > >
> > > Progress Globalization Program
> > > http://www.progress.com/services/partners/globalization/index.htm
> > > ------------------------------------------------------------------------------------------------
> > > Spanish Proverb: Don't speak unless you can improve on the silence.
> > > Tex's Proverb: Don't email unless you can improve on the screen saver.
>
> --
> Progress is a proud sponsor of the 16th International Unicode Conference
> March 27-30, 2000 in Amsterdam, Holland
> http://www.unicode.org/iuc/iuc16/index.html
> See our panel on Open Source Approaches to Unicode Libraries
> http://www.unicode.org/iuc/iuc16/a206.html
> ------------------------------------------------------------------------------------------------
> Tex Texin Director, International Products
>
> Progress Software Corp. +1-781-280-4271
> 14 Oak Park +1-781-280-4655 (Fax)
> Bedford, MA 01730 USA [email protected]
>
> http://www.progress.com The #1 Embedded Database
> http://www.SonicMQ.com JMS Compliant Messaging- Best Middleware
> Award
> http://www.aspconnections.com Leading provider in the ASP marketplace
>
> Progress Globalization Program
> http://www.progress.com/services/partners/globalization/index.htm
> ------------------------------------------------------------------------------------------------
> Spanish Proverb: Don't speak unless you can improve on the silence.
> Tex's Proverb: Don't email unless you can improve on the screen saver.

Next message: Peter Constable: "Re: U+xxxx, U-xxxxxx, and the basics"
Previous message: Dan Oscarsson: "Re: U+xxxx, U-xxxxxx, and the basics"
In reply to: Tex Texin: "Re: Rationale wanted for Unicode identifier rules"
Next in thread: Tex Texin: "Re: Rationale wanted for Unicode identifier rules"
Reply: Tex Texin: "Re: Rationale wanted for Unicode identifier rules"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT