Re: Wanted: synonyms for Age

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jul 27 2009 - 15:08:10 CDT

Next message: karl williamson: "Re: Wanted: synonyms for Age"

Previous message: Kenneth Whistler: "Re: Wanted: synonyms for Age"
Maybe in reply to: karl williamson: "Wanted: synonyms for Age"
Next in thread: karl williamson: "Re: Wanted: synonyms for Age"
Reply: karl williamson: "Re: Wanted: synonyms for Age"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Karl Williamson noted:

> Apparently that is what Asmus and others think as well,

Add me to that list.

> and it certainly
> is the data that comes in DerivedAge.txt,

And in the XML data derived from it, as well -- which is Eric's
point, I think.

> and if that were truly the
> case, I wouldn't have any problem with the term "Age".

Well, then you're all set! ;-)

> But let me quote
> from the header of that file:
> # Caution: When using the Age *property*, all assigned code points
> # in each version are included, not just the newly assigned code points.
> # For more information, see http://www.unicode.org/reports/tr18/
>
> And, if you look at tr18, it says:
>
> "
> Caution: The DerivedAge data file in the UCD provides the deltas between
> versions, for compactness. However, when using the property all
> characters included in that version are included. Thus \p{age=3.0}
> includes the letter a, which was included in Unicode 1.0. To get
> characters that are new in a particular version, subtract off the
> previous version as described in 1.3 Subtraction and Intersection. For
> example: [\p{age=3.1} -- \p{age=3.0}]
> "
>
> So either you guys are wrong, or the documentation is wrong in at least
> two places.

The documentation is wrong in two places -- or at least
misleading. Note that it doesn't actually say the property
is *defined* thus and such, but rather that "when using the
property all characters included in that version are included."
That amounts to a pocket definition of a new derived property
(or actually set of properties) based on the use of the Age property
per se.

This is one of these cases where an insufficiently carefully
documented property is trying to have it both ways.

Age is an enumerated property in the UCD. Among other things, that
means that its values constitute a codespace partition. Each
code point has one and and only one value of the property. Both
the values in DerivedAge.txt and in the XML data files reflect
that interpretation.

The property defined that way is not, however, as useful as the
property described the way it is used for regex matches in UTS #18,
because it is far more useful for regex matches to know if a
character is included in Unicode Version X (or any *earlier*
version), rather than to know if it was encoded exactly in
Version X. So the usage of the Age property in UTS #18 just
blithely assumes that interpretation, and the caution at the
top of DerivedAge.txt reflects that interpretation, even though
it is in direct contradiction with the data itself.

Note that there are no character properties in the UCD actually
defined the way the Caution at the top of DerivedAge.txt currently
implies Age is interpreted. If you think this through, for
example, interpreted that way, U+0041 would have multiple
Age property values: 1.1, 2.0, 2.1, 3.0, 3.1, 3.2, 4.0,
4.1, 5.0, 5.1, and soon, 5.2, because it would match a
\p{age=n.n} expression for any of those values. Every character
would continue to accumulate new Age values as future versions
of the standard are published.

> I have to assume that the documentation is right until
> shown otherwise; and if it is correct, I think that proves my point. If
> experienced people who work with Unicode all the time don't understand
> what this property is, then something is wrong, and at a minimum a new
> alias is needed to clarify things.

There is definitely need for clarification here.

> I also don't think that in these days of abundant cheap storage that the
> Consortium should be worrying about compactness.

Compactness is not the primary concern driving maintenance of
UCD properties (and files) by the way.

> I believe every
> property that is exposed in the UCD should have a fully derived version
> available, probably in the extracted directory. In 5.2 Beta, the only
> properties and property values that the user has to derive (except for
> defaults) are Age, gc=LC, gc=C, gc=L gc=M, gc=N, gc=P, gc=S, and gc=Z.

However, none of those are actually property values per se. They
are certainly not *extracted* values.

Each of those is a different kind of derived property value.

So gc=L (which I assume you meant, rather than "gc=LC") is actually
not a value of General_Category proper at all, but rather the
union of the set of characters with five different values:

(gc=Lu) | (gc=Ll) | (gc=Lt) | (gc=Lo) | (gc=Lm)

While it is certainly easy to derive such sets from the data, it
is also perfectly reasonable to ask for pre-derived listings of
such derived values in the UCD. It would be up to the UTC to
decide whether the extra work to maintain additional derived
values for each release is worth the benefit in such cases. Note
that ICU provides a generic Unicode set notation that makes it
trivial to construct such sets.

Also, regarding "Age", what you are asking in this case would be
not *one* derived property, but rather a distinct derived
binary property for *each* Unicode version. I.e.:

Included_In_Version_1_1 --> (Age=1.1)

Included_In_Version_2_0 --> (Age=1.1) | (Age=2.0)

Included_In_Version_2_1 --> (Age=1.1) | (Age=2.0) | (Age=2.1)

Included_In_Version_3_0 --> (Age=1.1) | (Age=2.0) | (Age=2.1) | (Age=3.0)

etc., etc., for each succeeding version.

IMO, it isn't actually worth the effort to define and maintain
such a list of derived property values (or equivalently, just
the sets of characters, without actually *naming* the properties
they assume), when the derivations are so trivial based on
the existing DerivedAge.txt file. This is especially true for
that particular file, because all you have to do is delete
all the entries below the Age of concern, and the entries
above it define your set in question. No programming necessary. :-)

> There should be files in the extracted directory that show the derived
> values for all of them. There are bound to be mistakes made when
> programmers re-derive them; and there is duplicated work as well. This
> Age property is a case in point. I wonder how many implementations
> there are out there that have it wrong.

Not too many, I would wager -- since most of them would be using
one or the other of the two interpretations, and would have picked
the one they wanted to accomplish what they were after. It is
rather unlikely that there are many applications out there using
an interpretation "all characters included in Version 3.0", but
which are then blindly using Age=3.0 values from DerivedAge.txt,
ignoring all the characters with Age=1.1, 2.0, or 2.1, for example.

--Ken

Next message: karl williamson: "Re: Wanted: synonyms for Age"
Previous message: Kenneth Whistler: "Re: Wanted: synonyms for Age"
Maybe in reply to: karl williamson: "Wanted: synonyms for Age"
Next in thread: karl williamson: "Re: Wanted: synonyms for Age"
Reply: karl williamson: "Re: Wanted: synonyms for Age"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jul 27 2009 - 15:10:31 CDT