Re: Wanted: synonyms for Age

From: karl williamson (public@khwilliamson.com)
Date: Mon Jul 27 2009 - 16:33:02 CDT

  • Next message: Eric Muller: "Scientist at Work - Tucker Childs - A Linguist Preserves Languages With Digital Tools - NYTimes.com"

    First, thanks for the clarification.

    Kenneth Whistler wrote:
    > Karl Williamson noted:
    >
    >> Apparently that is what Asmus and others think as well,
    >
    > Add me to that list.
    >
    >> and it certainly
    >> is the data that comes in DerivedAge.txt,
    >
    > And in the XML data derived from it, as well -- which is Eric's
    > point, I think.
    >
    >> and if that were truly the
    >> case, I wouldn't have any problem with the term "Age".
    >
    > Well, then you're all set! ;-)
    >
    >> But let me quote
    >> from the header of that file:
    >> # Caution: When using the Age *property*, all assigned code points
    >> # in each version are included, not just the newly assigned code points.
    >> # For more information, see http://www.unicode.org/reports/tr18/
    >>
    >> And, if you look at tr18, it says:
    >>
    >> "
    >> Caution: The DerivedAge data file in the UCD provides the deltas between
    >> versions, for compactness. However, when using the property all
    >> characters included in that version are included. Thus \p{age=3.0}
    >> includes the letter a, which was included in Unicode 1.0. To get
    >> characters that are new in a particular version, subtract off the
    >> previous version as described in 1.3 Subtraction and Intersection. For
    >> example: [\p{age=3.1} -- \p{age=3.0}]
    >> "
    >>
    >> So either you guys are wrong, or the documentation is wrong in at least
    >> two places.
    >
    > The documentation is wrong in two places -- or at least
    > misleading. Note that it doesn't actually say the property
    > is *defined* thus and such, but rather that "when using the
    > property all characters included in that version are included."
    > That amounts to a pocket definition of a new derived property
    > (or actually set of properties) based on the use of the Age property
    > per se.
    >
    > This is one of these cases where an insufficiently carefully
    > documented property is trying to have it both ways.
    >

    Is there some way I could find out what other things might be like this
    that I've overlooked in trying to learn Unicode?

    > Age is an enumerated property in the UCD. Among other things, that
    > means that its values constitute a codespace partition. Each
    > code point has one and and only one value of the property. Both
    > the values in DerivedAge.txt and in the XML data files reflect
    > that interpretation.
    >
    > The property defined that way is not, however, as useful as the
    > property described the way it is used for regex matches in UTS #18,
    > because it is far more useful for regex matches to know if a
    > character is included in Unicode Version X (or any *earlier*
    > version), rather than to know if it was encoded exactly in
    > Version X. So the usage of the Age property in UTS #18 just
    > blithely assumes that interpretation, and the caution at the
    > top of DerivedAge.txt reflects that interpretation, even though
    > it is in direct contradiction with the data itself.
    >
    > Note that there are no character properties in the UCD actually
    > defined the way the Caution at the top of DerivedAge.txt currently
    > implies Age is interpreted. If you think this through, for
    > example, interpreted that way, U+0041 would have multiple
    > Age property values: 1.1, 2.0, 2.1, 3.0, 3.1, 3.2, 4.0,
    > 4.1, 5.0, 5.1, and soon, 5.2, because it would match a
    > \p{age=n.n} expression for any of those values. Every character
    > would continue to accumulate new Age values as future versions
    > of the standard are published.
    >
    >> I have to assume that the documentation is right until
    >> shown otherwise; and if it is correct, I think that proves my point. If
    >> experienced people who work with Unicode all the time don't understand
    >> what this property is, then something is wrong, and at a minimum a new
    >> alias is needed to clarify things.
    >
    > There is definitely need for clarification here.
    >

    Does that mean someone will look into it, or do I need to submit a
    formal request?

    >> I also don't think that in these days of abundant cheap storage that the
    >> Consortium should be worrying about compactness.
    >
    > Compactness is not the primary concern driving maintenance of
    > UCD properties (and files) by the way.
    >
    >> I believe every
    >> property that is exposed in the UCD should have a fully derived version
    >> available, probably in the extracted directory. In 5.2 Beta, the only
    >> properties and property values that the user has to derive (except for
    >> defaults) are Age, gc=LC, gc=C, gc=L gc=M, gc=N, gc=P, gc=S, and gc=Z.
    >
    > However, none of those are actually property values per se. They
    > are certainly not *extracted* values.
    >

    I guess they would go in the DerivedCoreProperties.txt or some new file,
    should the UTC agree.

    > Each of those is a different kind of derived property value.
    >
    > So gc=L (which I assume you meant, rather than "gc=LC") is actually

    Actually I did mean gc=LC. Here's the entry from PropertyValueAliases.txt:
    gc ; LC ; Cased_Letter # Ll | Lt | Lu

    > not a value of General_Category proper at all, but rather the
    > union of the set of characters with five different values:
    >
    > (gc=Lu) | (gc=Ll) | (gc=Lt) | (gc=Lo) | (gc=Lm)
    >
    > While it is certainly easy to derive such sets from the data, it
    > is also perfectly reasonable to ask for pre-derived listings of
    > such derived values in the UCD. It would be up to the UTC to
    > decide whether the extra work to maintain additional derived
    > values for each release is worth the benefit in such cases. Note
    > that ICU provides a generic Unicode set notation that makes it
    > trivial to construct such sets.

    And yes, they are trivial to derive, and at this stage probably not
    likely to change, but it's still work that has to be done, and it seems
    to me to be better done once, centrally, than many times. So I will
    submit a proposal for that. My request came about because I'm
    maintaining some code that hadn't kept up with the changing definitions
    of Case_Ignorable over the years; 5.2 Beta has that derived for us, and
    it occurred to me that why should I have to derive anything.
    >
    > Also, regarding "Age", what you are asking in this case would be
    > not *one* derived property, but rather a distinct derived
    > binary property for *each* Unicode version. I.e.:
    >
    > Included_In_Version_1_1 --> (Age=1.1)
    >
    > Included_In_Version_2_0 --> (Age=1.1) | (Age=2.0)
    >
    > Included_In_Version_2_1 --> (Age=1.1) | (Age=2.0) | (Age=2.1)
    >
    > Included_In_Version_3_0 --> (Age=1.1) | (Age=2.0) | (Age=2.1) | (Age=3.0)
    >
    > etc., etc., for each succeeding version.
    >
    > IMO, it isn't actually worth the effort to define and maintain
    > such a list of derived property values (or equivalently, just
    > the sets of characters, without actually *naming* the properties
    > they assume), when the derivations are so trivial based on
    > the existing DerivedAge.txt file. This is especially true for
    > that particular file, because all you have to do is delete
    > all the entries below the Age of concern, and the entries
    > above it define your set in question. No programming necessary. :-)

    I suppose that a "pseudo-property" without a 1-1 correspondence, as
    implied in these two places in the documentation is out of the question?
    >
    >> There should be files in the extracted directory that show the derived
    >> values for all of them. There are bound to be mistakes made when
    >> programmers re-derive them; and there is duplicated work as well. This
    >> Age property is a case in point. I wonder how many implementations
    >> there are out there that have it wrong.
    >
    > Not too many, I would wager -- since most of them would be using
    > one or the other of the two interpretations, and would have picked
    > the one they wanted to accomplish what they were after. It is
    > rather unlikely that there are many applications out there using
    > an interpretation "all characters included in Version 3.0", but
    > which are then blindly using Age=3.0 values from DerivedAge.txt,
    > ignoring all the characters with Age=1.1, 2.0, or 2.1, for example.
    >
    > --Ken
    >
    >



    This archive was generated by hypermail 2.1.5 : Mon Jul 27 2009 - 16:35:39 CDT