Re: Rationale wanted for Unicode identifier rules

From: Tex Texin (texin@progress.com)
Date: Mon Mar 06 2000 - 00:49:40 EST


Mark,
I understood your concern for data and interoperability. I don't
understand why XML is more of a concern than other widely used
file formats.

1) all files do not use all characters
2) all files are not read by all users

So the upgrade requirements can be attributed to the users that need to
read the files with the
new characters. It does not need to be a constraint on the file
producers (although
producers will probably identify and recognize their target market's
needs.) Version numbering
simplifies the determination if the file will be readable.

This situation occurs analagously for many other file formats and
products and seems
to work itself out. Initially, producers don't use the new version
unless it provides functionality
needed for the particular task. At some point the number of documents
using the new version
is large enough that the majority of users are forced to upgrade.

PDF went from 3 to 4, Word for Windows changes formats with new
versions... These are also
widely distributed file formats that have the same interoperability
issues. I don't see that
XML is in a special class, where we need to constrain producers. We can
let the market
sort it out and I don't think it will be that destructive. (Maybe that
is the point I am
missing.)

At least we have a great advantage today, over previous file formats, in
that Unicode 3.0
certainly accomodates a greater collection of characters, than the world
has had before,
especially relative to past requirements for multi-codepage
implementations.
Once XML moves to Unicode 3.0, subsequently here won't be as many files
needing
post 3.0 characters. (Certainly, there will be need for new characters,
but not
as many files will need these characters...)

tex

Mark Davis wrote:
>
> Because XML files will be distributed broadly, as broadly as HTML, that means is that in effect nobody can use the new characters, because you can't be sure that parsers on the other side will accept them. This is different than the compiler case, where you have much more control over the versioning. Since this is data, it can potentially end up anywhere.
>
> Your suggestion of adding the additional identifier characters (categorized into 2 groups) is a possibility. If restricted to the actual characters required by the document it might not be too bad. I'd have to ask the parser people whether this would be easy or hard (it would definitely impact performance, I suspect).
>
> Mark
>
> Tex Texin wrote:
>
> > On successive upgrades, #2-
> >
> > No language can guarantee downward compatibility (without
> > remaining static itself).
> >
> > As long as character properties, once defined, do not change, then
> > at least XML can have upward compatibility, so something defined
> > with XML 1.0 should continue to work with successive upgrades.
> >
> > I don't see why it would be unacceptable to have the scenario
> > where XML 1.0 files continue to work, but something that
> > takes advantage of functionality in a later version requires
> > that version or later. It's pretty much the way of the world.
> >
> > There is perhaps one other scenario-
> > If there is a way for an XML file to optionally carry a definition of
> > character properties with it, then it can be downward compatible.
> > Of course you wouldn't want to define all characters, maybe just those
> > define later than some version. Then it would be able to be parsed
> > by versions down to whatever version was desired.
> >
> > (Perhaps you would want to validate that no character that was defined
> > in the parser was not having its properties changed or overridden.
> > I am not sure if this is needed or not yet.)
> >
> > It would mean a parser would have to be able to append to its
> > character property table for the duration of the processing of the
> > file and then return to its original state for the next file.
> >
> > tex
> >
> > Mark Davis wrote:
> > >
> > > In general, I agree with the discussion here: identifiers should be chosen on the basis of character properties. As new characters are assigned, they are given appropriate properties, and the class of possible identifiers grows. There are, however, difficulties with this approach in certain contexts.
> > >
> > > Take, for example, XML identifiers. The difference in this case is that the identifiers occur in structured data, not program text. This data will live for years. The conformance requirements for XML identifiers are very strict. This is absolutely correct, since it guarantees compatibility around the world. But what this means is that the current, conformant XML parsers cannot accept new Unicode 3.0 letters in identifiers. There are a few main approaches to identifiers in XML, listed below.
> > >
> > > [One note that is relevant to all of these: while <identifier_extend> includes Cf, these character should be filtered when composing identifiers, so there are actually 4 relevant categories for parsing identifiers. However, there are reserved blocks (2060-206F and E0000-E1000) now for Cf characters, so they should not present a problem.]
> > >
> > > 1. Status quo.
> > > Never accept characters outside of Unicode 2.0 in identifiers. Downside: new scripts, and additions (e.g. CJK ideographs) to existing scripts are disadvantaged -- forever.
> > >
> > > <identifier_start> := Lu, Ll, Lt, Lm, Lo, Nl -- as of Unicode 2.0
> > > <identifier_extend> := Mn, Mc, Nd, Pc, Cf -- as of Unicode 2.0
> > > <excluded> := Me, No, Zs, Zl, Zp, Cc, Pd, Ps, Pe, Pi, Pf, Po, Sm, Sc, Sk, So -- as of Unicode 2.0
> > >
> > > 2. Successive upgrades.
> > > Revise XML with each version of Unicode. This means you will have XML 1.0-compliant parsers, XML 1.1 compliant parsers, etc.. Disadvantage: it takes years for compliant parsers to be fully spread across the world. During that time, data interchange between different versions of parsers cannot be guaranteed. I believe this will be unacceptable to the XML community.
> > >
> > > 3. Open Season.
> > > Define identifiers to be *fixed* as of Unicode 3.0, but to also include unassigned characters (Cn) as of that version.
> > >
> > > Identifiers are thus fixed for all time. They include all new letters that will be defined. Disadvantage: they will also include new punctuation, symbols, etc. defined post-3.0.
> > >
> > > <identifier_start> := Lu, Ll, Lt, Lm, Lo, Nl, AND Cn -- as of Unicode 3.0
> > > <identifier_extend> := Mn, Mc, Nd, Pc, Cf, AND Cn -- as of Unicode 3.0
> > > <excluded> := Me, No, Zs, Zl, Zp, Cc, Pd, Ps, Pe, Pi, Pf, Po, Sm, Sc, Sk, So -- as of Unicode 3.0
> > >
> > > 4. Restricted Open Season.
> > > The Unicode consortium divides up the unassigned space in more detail, and specifies that <excluded> characters and <identifier_extend> characters can only be allocated in the future within certain blocks. This has the effect of dividing Cn into subcategories: Cni, Cne, and Cnx. While characters will change from each of these to other properties over time as characters become assigned, the three relevant categories will remain unchanged.
> > >
> > > Since future allocations will not disturb the identifier syntax, identifiers are thus fixed for all time. Disadvantage: the consortium as a whole has resisted such assignment of blocks for unassigned characters in the past (except Cf).
> > >
> > > <identifier_start> := Lu, Ll, Lt, Lm, Lo, Nl, AND Cni
> > > <identifier_extend> := Mn, Mc, Nd, Pc, Cf, AND Cne
> > > <excluded> := Me, No, Zs, Zl, Zp, Cc, Pd, Ps, Pe, Pi, Pf, Po, Sm, Sc, Sk, So, AND Cnx
> > >
> > > Mark
> >
> > --
> > Progress is a proud sponsor of the 16th International Unicode Conference
> > March 27-30, 2000 in Amsterdam, Holland
> > http://www.unicode.org/iuc/iuc16/index.html
> > See our panel on Open Source Approaches to Unicode Libraries
> > http://www.unicode.org/iuc/iuc16/a206.html
> > ------------------------------------------------------------------------------------------------
> > Tex Texin Director, International Products
> >
> > Progress Software Corp. +1-781-280-4271
> > 14 Oak Park +1-781-280-4655 (Fax)
> > Bedford, MA 01730 USA texin@bedford.progress.com
> >
> > http://www.progress.com The #1 Embedded Database
> > http://www.SonicMQ.com JMS Compliant Messaging- Best Middleware
> > Award
> > http://www.aspconnections.com Leading provider in the ASP marketplace
> >
> > Progress Globalization Program
> > http://www.progress.com/services/partners/globalization/index.htm
> > ------------------------------------------------------------------------------------------------
> > Spanish Proverb: Don't speak unless you can improve on the silence.
> > Tex's Proverb: Don't email unless you can improve on the screen saver.

-- 
Progress is a proud sponsor of the 16th International Unicode Conference
March 27-30, 2000 in Amsterdam, Holland
http://www.unicode.org/iuc/iuc16/index.html
See our panel on Open Source Approaches to Unicode Libraries
http://www.unicode.org/iuc/iuc16/a206.html
------------------------------------------------------------------------------------------------
Tex Texin                     Director, International Products
                                 
Progress Software Corp.       +1-781-280-4271
14 Oak Park                   +1-781-280-4655 (Fax)
Bedford, MA 01730  USA        texin@bedford.progress.com

http://www.progress.com The #1 Embedded Database http://www.SonicMQ.com JMS Compliant Messaging- Best Middleware Award http://www.aspconnections.com Leading provider in the ASP marketplace

Progress Globalization Program http://www.progress.com/services/partners/globalization/index.htm ------------------------------------------------------------------------------------------------ Spanish Proverb: Don't speak unless you can improve on the silence. Tex's Proverb: Don't email unless you can improve on the screen saver.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT