Re: Wanted: synonyms for Age

From: karl williamson (public@khwilliamson.com)
Date: Thu Aug 06 2009 - 16:29:14 CDT

  • Next message: Kenneth Whistler: "Re: Wanted: synonyms for Age"

    Since my last post seems unlikely to be responded too, (reproduced
    below), I thought I should add some things I've been thinking about to
    make sure I understand. Feel free to correct me.

    Each Unicode property is defined on a subset of the Unicode code points.
      Many are defined on the complete set, but some are not, such as Name,
    as for example, surrogates and private use code points have no name.
    It's unclear to me if in releases before the Unknown property value was
    added to the Script property, what the definition was, if any, of code
    points that didn't have any other of the Script property values (and
    similarly for a number of other catalog properties).

    A property is a mapping from single code points to values. (Named
    sequences and Standardized Variants, and I don't know about the Unihan
    ones are anomalous.) Each code point that the property is defined for
    has a single value.

    This means that properties are true functions in the strict mathematical
    sense, because the mapping for each code point is to a unique value.

    However, when using a property as part of a regular expression pattern,
    what is desired is essentially the reverse mapping, or 'inverse
    relation' in mathematical terminology. For example (using Perl syntax)
    'A' =~ /\p{age=3.2}/, has us start with the property value 3.2, not the
    letter 'A', and then see if age('A') is 3.2. This inverse mapping is
    not necessarily a function; just a relation. For example, the property
    value '3.2' can map to many code points, not just 'A'.

    This distinction between the property mapping and the inverse mapping
    was lost on me until this issue came up.

    TR18 appears to be requiring that regular expressions not use the true
    inverse relation of Age, but a different one, one which makes more sense
    for real-world applications. If one were to accurately name that
    inverse relation, it wouldn't be 'Age', but something more like
    'Designated_As_Of'

    karl williamson wrote:
    > First, thanks for the clarification.
    >
    > Kenneth Whistler wrote:
    >> Karl Williamson noted:
    >>
    >>> Apparently that is what Asmus and others think as well,
    >>
    >> Add me to that list.
    >>
    >>> and it certainly is the data that comes in DerivedAge.txt,
    >>
    >> And in the XML data derived from it, as well -- which is Eric's
    >> point, I think.
    >>
    >>> and if that were truly the case, I wouldn't have any problem with the
    >>> term "Age".
    >>
    >> Well, then you're all set! ;-)
    >>
    >>> But let me quote from the header of that file:
    >>> # Caution: When using the Age *property*, all assigned code points
    >>> # in each version are included, not just the newly assigned code points.
    >>> # For more information, see http://www.unicode.org/reports/tr18/
    >>>
    >>> And, if you look at tr18, it says:
    >>>
    >>> "
    >>> Caution: The DerivedAge data file in the UCD provides the deltas
    >>> between versions, for compactness. However, when using the property
    >>> all characters included in that version are included. Thus
    >>> \p{age=3.0} includes the letter a, which was included in Unicode 1.0.
    >>> To get characters that are new in a particular version, subtract off
    >>> the previous version as described in 1.3 Subtraction and
    >>> Intersection. For example: [\p{age=3.1} -- \p{age=3.0}]
    >>> "
    >>>
    >>> So either you guys are wrong, or the documentation is wrong in at
    >>> least two places.
    >>
    >> The documentation is wrong in two places -- or at least
    >> misleading. Note that it doesn't actually say the property
    >> is *defined* thus and such, but rather that "when using the
    >> property all characters included in that version are included."
    >> That amounts to a pocket definition of a new derived property
    >> (or actually set of properties) based on the use of the Age property
    >> per se.
    >>
    >> This is one of these cases where an insufficiently carefully
    >> documented property is trying to have it both ways.
    >>
    >
    > Is there some way I could find out what other things might be like this
    > that I've overlooked in trying to learn Unicode?
    >
    >> Age is an enumerated property in the UCD. Among other things, that
    >> means that its values constitute a codespace partition. Each
    >> code point has one and and only one value of the property. Both
    >> the values in DerivedAge.txt and in the XML data files reflect
    >> that interpretation.
    >>
    >> The property defined that way is not, however, as useful as the
    >> property described the way it is used for regex matches in UTS #18,
    >> because it is far more useful for regex matches to know if a
    >> character is included in Unicode Version X (or any *earlier*
    >> version), rather than to know if it was encoded exactly in
    >> Version X. So the usage of the Age property in UTS #18 just
    >> blithely assumes that interpretation, and the caution at the
    >> top of DerivedAge.txt reflects that interpretation, even though
    >> it is in direct contradiction with the data itself.
    >>
    >> Note that there are no character properties in the UCD actually
    >> defined the way the Caution at the top of DerivedAge.txt currently
    >> implies Age is interpreted. If you think this through, for
    >> example, interpreted that way, U+0041 would have multiple
    >> Age property values: 1.1, 2.0, 2.1, 3.0, 3.1, 3.2, 4.0,
    >> 4.1, 5.0, 5.1, and soon, 5.2, because it would match a
    >> \p{age=n.n} expression for any of those values. Every character
    >> would continue to accumulate new Age values as future versions
    >> of the standard are published.
    >>
    >>> I have to assume that the documentation is right until shown
    >>> otherwise; and if it is correct, I think that proves my point. If
    >>> experienced people who work with Unicode all the time don't
    >>> understand what this property is, then something is wrong, and at a
    >>> minimum a new alias is needed to clarify things.
    >>
    >> There is definitely need for clarification here.
    >>
    >
    > Does that mean someone will look into it, or do I need to submit a
    > formal request?
    >
    >>> I also don't think that in these days of abundant cheap storage that
    >>> the Consortium should be worrying about compactness.
    >>
    >> Compactness is not the primary concern driving maintenance of
    >> UCD properties (and files) by the way.
    >>
    >>> I believe every property that is exposed in the UCD should have a
    >>> fully derived version available, probably in the extracted
    >>> directory. In 5.2 Beta, the only properties and property values that
    >>> the user has to derive (except for defaults) are Age, gc=LC, gc=C,
    >>> gc=L gc=M, gc=N, gc=P, gc=S, and gc=Z.
    >>
    >> However, none of those are actually property values per se. They
    >> are certainly not *extracted* values.
    >>
    >
    > I guess they would go in the DerivedCoreProperties.txt or some new file,
    > should the UTC agree.
    >
    >> Each of those is a different kind of derived property value.
    >>
    >> So gc=L (which I assume you meant, rather than "gc=LC") is actually
    >
    > Actually I did mean gc=LC. Here's the entry from PropertyValueAliases.txt:
    > gc ; LC ; Cased_Letter # Ll | Lt | Lu
    >
    >> not a value of General_Category proper at all, but rather the
    >> union of the set of characters with five different values:
    >>
    >> (gc=Lu) | (gc=Ll) | (gc=Lt) | (gc=Lo) | (gc=Lm)
    >> While it is certainly easy to derive such sets from the data, it
    >> is also perfectly reasonable to ask for pre-derived listings of
    >> such derived values in the UCD. It would be up to the UTC to
    >> decide whether the extra work to maintain additional derived
    >> values for each release is worth the benefit in such cases. Note
    >> that ICU provides a generic Unicode set notation that makes it
    >> trivial to construct such sets.
    >
    > And yes, they are trivial to derive, and at this stage probably not
    > likely to change, but it's still work that has to be done, and it seems
    > to me to be better done once, centrally, than many times. So I will
    > submit a proposal for that. My request came about because I'm
    > maintaining some code that hadn't kept up with the changing definitions
    > of Case_Ignorable over the years; 5.2 Beta has that derived for us, and
    > it occurred to me that why should I have to derive anything.
    >>
    >> Also, regarding "Age", what you are asking in this case would be
    >> not *one* derived property, but rather a distinct derived
    >> binary property for *each* Unicode version. I.e.:
    >>
    >> Included_In_Version_1_1 --> (Age=1.1)
    >>
    >> Included_In_Version_2_0 --> (Age=1.1) | (Age=2.0)
    >>
    >> Included_In_Version_2_1 --> (Age=1.1) | (Age=2.0) | (Age=2.1)
    >>
    >> Included_In_Version_3_0 --> (Age=1.1) | (Age=2.0) | (Age=2.1) |
    >> (Age=3.0)
    >>
    >> etc., etc., for each succeeding version.
    >>
    >> IMO, it isn't actually worth the effort to define and maintain
    >> such a list of derived property values (or equivalently, just
    >> the sets of characters, without actually *naming* the properties
    >> they assume), when the derivations are so trivial based on
    >> the existing DerivedAge.txt file. This is especially true for
    >> that particular file, because all you have to do is delete
    >> all the entries below the Age of concern, and the entries
    >> above it define your set in question. No programming necessary. :-)
    >
    > I suppose that a "pseudo-property" without a 1-1 correspondence, as
    > implied in these two places in the documentation is out of the question?
    >>
    >>> There should be files in the extracted directory that show the
    >>> derived values for all of them. There are bound to be mistakes made
    >>> when programmers re-derive them; and there is duplicated work as
    >>> well. This Age property is a case in point. I wonder how many
    >>> implementations there are out there that have it wrong.
    >>
    >> Not too many, I would wager -- since most of them would be using
    >> one or the other of the two interpretations, and would have picked
    >> the one they wanted to accomplish what they were after. It is
    >> rather unlikely that there are many applications out there using
    >> an interpretation "all characters included in Version 3.0", but
    >> which are then blindly using Age=3.0 values from DerivedAge.txt,
    >> ignoring all the characters with Age=1.1, 2.0, or 2.1, for example.
    >>
    >> --Ken
    >>
    >>
    >
    >
    >

    -- 
    "He who cannot change the very fabric of his thought will never be able
    to change reality, and will never, therefore, make any progress" --
    Anwar Sadat
    


    This archive was generated by hypermail 2.1.5 : Thu Aug 06 2009 - 16:32:34 CDT