RE: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MA RK

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jun 23 2003 - 18:58:57 EDT

  • Next message: Kenneth Whistler: "Wash Symbols and Iconography (was Re: Revised N2586R)"

    Actually, there are a number of loose ends still, as it appears
    that some of Rob Mount's questions were not actually answered.

    > I understand what you say about word formation, and
    > combining marks, and that the Alphabetic
    > classification should not be limited to "L"s. But
    > 30FC is of General Category "Lm" (which should be
    > included) and, since version 3.1, is classified explicitly
    > as Alphabetic in DerivedCoreProperties.txt.
    > (It appears that formal expression of the Alphabetic
    > property was moved from PropList.txt
    > to DerivedCoreProperties.txt in 3.1.) I don't understand
    > why its exclusion from the Alphabetic
    > category in 3.0.1 was not an oversight. But if not,
    > then either the consortium consensus on
    > the classification of this character has changed, or
    > the current classification is in error.

    Here's some more background for people. I realize that all
    the version information is getting bewilderingly complex, so
    not everyone is going to want to research back through all the
    versions, particularly when that would mean also trying to
    dig back through the UTC decision trail.

    From Unicode Version 2.0 to Unicode Version 3.0.1 I maintained
    the PropList.txt file. During that time, it was explicitly
    an *informative* file only, and was included in the UNIDATA
    directory on that basis, as potentially helpful information, only.

    The change to Unicode Version 3.1.0 was a major watershed.
    Mark Davis started maintaining the PropList.txt file (and
    a number of other files) with a different set of tools
    that specified a large number of properties as derived,
    via rule, from other properties -- hence the introduction
    of the DerivedXXX files. At this point, the UTC reexamined
    all of the character properties and changed the status
    of some of them. Some of the former properties from PropList.txt
    were made normative (and their content adjusted slightly),
    some were left informative, some were equated to derived
    properties (hence moved to other files), and some were determined
    to be uninteresting, and thus were dropped altogether.
    The format of PropList.txt also changed completely at this
    point.

    Now as regards the particular handling of U+30FC, the
    treatment in PropList.txt from Unicode 2.0 to Unicode 3.0.1
    was consistent:

    General Category = Lm
    PropList specification: [-Alphabetic] [+Diacritic] [+Extender]
                            [+Identifier_Part]
                            
    The theory behind that was that while U+30FC was Lm, like
    many other diacritic letter modifiers it wasn't formally
    part of an alphabetic or syllabic set of symbols per se,
    so wouldn't be given the Alphabetic property. However,
    other implicit derivations for word boundaries or identifier
    boundaries should include the [+Extender] characters to
    get the expected results. Hence the determination, for
    example, that U+30FC was [+Identifier_Part].

    Starting with Unicode 3.1.0 and continuing through to Unicode
    4.0.0, the treatment is still consistent, although slightly
    different:

    General Category = Lm
    PropList specification: [-Other_Alphabetic] [+Diacritic] [+Extender]
    DerivedCoreProperties: [+Alphabetic] [+ID_Continue]

    The General Category, the status as diacritic and extender,
    and the derived status as part of identifiers are unchanged.
    What has changed, however, is the interpretation of what
    "Alphabetic", as a derived property now, means. As Mark pointed
    out, it is now derived as:

    # Derived Property: Alphabetic
    # Generated from: Lu+Ll+Lt+Lm+Lo+Nl + Other_Alphabetic

    By *this* definition, all the Lm characters from Unicode 2.0
    on would *also* have been "Alphabetic". And Other_Alphabetic
    was consistently developed by subtracting out all the
    Lu, Ll, Lt, Lm, Lo, and Nl characters from the preexisting
    Alphabetic definition from PropList.txt.

    So the correct answer is not that the consensus about the
    behavior and properties of U+30FC has changed, but rather
    that the inclusiveness of the "Alphabetic" property changed
    a little when it was redefined to be a derived property.

    Note that for the property more relevant to determination of
    things like identifiers (now known as ID_Continue), there has been
    *no* change to the behavior of U+30FC since Unicode 2.0.

    > Here's a little more background regarding my motivation.
    > The problem occurs in a procedure
    > that evaluates whether a user-supplied name can be used
    > as an identifier - for which identification
    > of alphabetic characters is important.

    Actually, as you can see from the above discussion, and
    from the discussion of identifiers you mentioned in the
    standard, it is ID_Start and ID_Continue which are more
    relevant than "Alphabetic" per se.

    > One implementation of isalpha(), purportedly based on
    > Unicode 2.1, indicates that 30FC is an alpha character.
    > The current implementation from the
    > same vendor, based on 3.0.1, classifies it as non-alpha.
    > Presumably the next one will be based
    > on 3.1 or later and will reclassify it, again, as alpha.

    The vendor has done something based on its own interpretation
    of the informative data files, then. The status of U+30FC did
    not change between Unicode 2.1 and Unicode 3.0.1 in the
    informative PropList.txt data file -- so whatever they did
    was on their own hook.

    > If we can't depend on uniform behavior
    > of isalpha() we will have to eliminate its use from our
    > validation function.

    I'd advise you to check the reference that Mark supplied
    regarding the use of POSIX functions in the context of
    Unicode character properties in the Proposed Update to
    UTS #18:

    http://www.unicode.org/reports/tr18/tr18-7.html

    See, particularly, Annex C: Compatibility Properties.

    There has been a lot of confusion about what isalpha() could
    mean in the context of a Universal Character Set, and POSIX
    provided little guidance for how to make the extensions.
    Note that Java and Perl are handling this differently than
    people who follow the recommendations of ISO TR 10176 (which
    excludes combining marks based on its own theory of what should
    be included in identifiers).

    >
    > So I am trying to discover why the behavior of isalpha()
    > has changed. Here are the possibilities:
    > 1) the previous implementation was incorrect and the current
    > one is fixed;
    > 2) the current implementation is flawed because it does not
    > conform to the documented standard;
    > 3) the current implementation is flawed because it's based on
    > incorrect documentation of the standard;
    > 4) both implementations are correct but are
    > based on different, incompatible standards;
    > 5) something else I don't yet understand.

    5. None of the above.
       5a. The property was informative in the first place, so a
           claim of conformance prior to the mechanisms put in place
           in Unicode 3.1.0 was a little out-of-place, anyway.
       5b. The Alphabetic property for U+30FC did not change between
           Unicode 2.1 and Unicode 3.0.1, so why your vendor changed
           it is based on some extraneous factor, and not based
           on some change in PropList.txt or a change in its documentation.
       5c. What changed beginning with Unicode 3.1.0 was the scope
           of the Alphabetic property itself (based on its switch to
           being a derived property), rather than any implication for
           how the particular character U+30FC should behave in
           implementations.
           
    >
    > The overriding assumption for this entire discussion is that
    > the behavior of isalpha() should
    > be governed by the Unicode Alphabetic property. That seems
    > reasonable to me and is, in fact, the vendor's claim.

    This is, in fact, what the UTC is now formally recommending, in
    the Proposed Update for UTS #18. It is not, however, what every
    vendor does for an isalpha() implementation in detail.

    > If not, (or even if so) perhaps someone can recommend a better
    > (or more stable) API for discovery of Unicode character metrics
    > upon which we might base
    > our identifier validation and other character processing logic.

    Unless you are specifically depending on Windows platform
    API's to make such determination, I would suggest the ICU
    implementation of character properties as likely to be
    the most accurate and up-to-date in a generally available
    cross-platform library.

    --Ken



    This archive was generated by hypermail 2.1.5 : Mon Jun 23 2003 - 19:39:15 EDT