RE: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MA RK

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jun 23 2003 - 18:58:57 EDT

Next message: Kenneth Whistler: "Wash Symbols and Iconography (was Re: Revised N2586R)"

Previous message: Michael Everson: "Re: Revised N2586R"
Maybe in reply to: Marco Cimarosti: "RE: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MA RK"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Actually, there are a number of loose ends still, as it appears
that some of Rob Mount's questions were not actually answered.

> I understand what you say about word formation, and
> combining marks, and that the Alphabetic
> classification should not be limited to "L"s. But
> 30FC is of General Category "Lm" (which should be
> included) and, since version 3.1, is classified explicitly
> as Alphabetic in DerivedCoreProperties.txt.
> (It appears that formal expression of the Alphabetic
> property was moved from PropList.txt
> to DerivedCoreProperties.txt in 3.1.) I don't understand
> why its exclusion from the Alphabetic
> category in 3.0.1 was not an oversight. But if not,
> then either the consortium consensus on
> the classification of this character has changed, or
> the current classification is in error.

Here's some more background for people. I realize that all
the version information is getting bewilderingly complex, so
not everyone is going to want to research back through all the
versions, particularly when that would mean also trying to
dig back through the UTC decision trail.

From Unicode Version 2.0 to Unicode Version 3.0.1 I maintained
the PropList.txt file. During that time, it was explicitly
an *informative* file only, and was included in the UNIDATA
directory on that basis, as potentially helpful information, only.

The change to Unicode Version 3.1.0 was a major watershed.
Mark Davis started maintaining the PropList.txt file (and
a number of other files) with a different set of tools
that specified a large number of properties as derived,
via rule, from other properties -- hence the introduction
of the DerivedXXX files. At this point, the UTC reexamined
all of the character properties and changed the status
of some of them. Some of the former properties from PropList.txt
were made normative (and their content adjusted slightly),
some were left informative, some were equated to derived
properties (hence moved to other files), and some were determined
to be uninteresting, and thus were dropped altogether.
The format of PropList.txt also changed completely at this
point.

Now as regards the particular handling of U+30FC, the
treatment in PropList.txt from Unicode 2.0 to Unicode 3.0.1
was consistent:

General Category = Lm
PropList specification: [-Alphabetic] [+Diacritic] [+Extender]
[+Identifier_Part]

The theory behind that was that while U+30FC was Lm, like
many other diacritic letter modifiers it wasn't formally
part of an alphabetic or syllabic set of symbols per se,
so wouldn't be given the Alphabetic property. However,
other implicit derivations for word boundaries or identifier
boundaries should include the [+Extender] characters to
get the expected results. Hence the determination, for
example, that U+30FC was [+Identifier_Part].

Starting with Unicode 3.1.0 and continuing through to Unicode
4.0.0, the treatment is still consistent, although slightly
different:

General Category = Lm
PropList specification: [-Other_Alphabetic] [+Diacritic] [+Extender]
DerivedCoreProperties: [+Alphabetic] [+ID_Continue]

The General Category, the status as diacritic and extender,
and the derived status as part of identifiers are unchanged.
What has changed, however, is the interpretation of what
"Alphabetic", as a derived property now, means. As Mark pointed
out, it is now derived as:

# Derived Property: Alphabetic
# Generated from: Lu+Ll+Lt+Lm+Lo+Nl + Other_Alphabetic

By *this* definition, all the Lm characters from Unicode 2.0
on would *also* have been "Alphabetic". And Other_Alphabetic
was consistently developed by subtracting out all the
Lu, Ll, Lt, Lm, Lo, and Nl characters from the preexisting
Alphabetic definition from PropList.txt.

So the correct answer is not that the consensus about the
behavior and properties of U+30FC has changed, but rather
that the inclusiveness of the "Alphabetic" property changed
a little when it was redefined to be a derived property.

Note that for the property more relevant to determination of
things like identifiers (now known as ID_Continue), there has been
*no* change to the behavior of U+30FC since Unicode 2.0.

> Here's a little more background regarding my motivation.
> The problem occurs in a procedure
> that evaluates whether a user-supplied name can be used
> as an identifier - for which identification
> of alphabetic characters is important.

Actually, as you can see from the above discussion, and
from the discussion of identifiers you mentioned in the
standard, it is ID_Start and ID_Continue which are more
relevant than "Alphabetic" per se.

> One implementation of isalpha(), purportedly based on
> Unicode 2.1, indicates that 30FC is an alpha character.
> The current implementation from the
> same vendor, based on 3.0.1, classifies it as non-alpha.
> Presumably the next one will be based
> on 3.1 or later and will reclassify it, again, as alpha.

The vendor has done something based on its own interpretation
of the informative data files, then. The status of U+30FC did
not change between Unicode 2.1 and Unicode 3.0.1 in the
informative PropList.txt data file -- so whatever they did
was on their own hook.

> If we can't depend on uniform behavior
> of isalpha() we will have to eliminate its use from our
> validation function.

I'd advise you to check the reference that Mark supplied
regarding the use of POSIX functions in the context of
Unicode character properties in the Proposed Update to
UTS #18:

http://www.unicode.org/reports/tr18/tr18-7.html

See, particularly, Annex C: Compatibility Properties.

There has been a lot of confusion about what isalpha() could
mean in the context of a Universal Character Set, and POSIX
provided little guidance for how to make the extensions.
Note that Java and Perl are handling this differently than
people who follow the recommendations of ISO TR 10176 (which
excludes combining marks based on its own theory of what should
be included in identifiers).

>
> So I am trying to discover why the behavior of isalpha()
> has changed. Here are the possibilities:
> 1) the previous implementation was incorrect and the current
> one is fixed;
> 2) the current implementation is flawed because it does not
> conform to the documented standard;
> 3) the current implementation is flawed because it's based on
> incorrect documentation of the standard;
> 4) both implementations are correct but are
> based on different, incompatible standards;
> 5) something else I don't yet understand.

5. None of the above.
   5a. The property was informative in the first place, so a
       claim of conformance prior to the mechanisms put in place
       in Unicode 3.1.0 was a little out-of-place, anyway.
   5b. The Alphabetic property for U+30FC did not change between
       Unicode 2.1 and Unicode 3.0.1, so why your vendor changed
       it is based on some extraneous factor, and not based
       on some change in PropList.txt or a change in its documentation.
   5c. What changed beginning with Unicode 3.1.0 was the scope
       of the Alphabetic property itself (based on its switch to
       being a derived property), rather than any implication for
       how the particular character U+30FC should behave in
       implementations.

>
> The overriding assumption for this entire discussion is that
> the behavior of isalpha() should
> be governed by the Unicode Alphabetic property. That seems
> reasonable to me and is, in fact, the vendor's claim.

This is, in fact, what the UTC is now formally recommending, in
the Proposed Update for UTS #18. It is not, however, what every
vendor does for an isalpha() implementation in detail.

> If not, (or even if so) perhaps someone can recommend a better
> (or more stable) API for discovery of Unicode character metrics
> upon which we might base
> our identifier validation and other character processing logic.

Unless you are specifically depending on Windows platform
API's to make such determination, I would suggest the ICU
implementation of character properties as likely to be
the most accurate and up-to-date in a generally available
cross-platform library.

--Ken

Next message: Kenneth Whistler: "Wash Symbols and Iconography (was Re: Revised N2586R)"
Previous message: Michael Everson: "Re: Revised N2586R"
Maybe in reply to: Marco Cimarosti: "RE: Classification of U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MA RK"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jun 23 2003 - 19:39:15 EDT