Re: UCD stability

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Mar 11 2005 - 20:23:51 CST

  • Next message: Erik van der Poel: "Re: UCD stability"

    Erik,

    If you are going to do things like pass these raw calculations
    along to the IDN list, ostensibly as some measure of stability
    of the UCD data, then you should take into consideration another
    metric.

    The raw number of characters changing is less reflective of
    stability than considering how many *decisions* to change
    a property (of one or more characters) were taken.

    I intersperse some notes to Andrew West's calculated numbers
    below, to help put this in context.

    > Andrew C. West wrote:
    > > According to my calculations, the number of characters which changed their
    > > General Category from one version of Unicode to the next is :
    > >
    > > 1.1.5 -> 2.0.14 = 474 (1.384%)

    Many, many, changes, since 1.1.5 was developed in house,
    without general public review, and since 2.0.14 (the
    data version corresponding to Unicode 2.0) was the first
    public release of the data files.

    > > 2.0.14 -> 2.1.2 = 1 (0.0025%)

    1 decision

    > > 2.1.2 -> 2.1.5 = 16 (0.0410%)

    2 decisions: addition of Pi/Pf subcategories, and 1 fix for 8 Tibetan
    characters

    > > 2.1.5 -> 2.1.8 = 18 (0.0462%)

    1 decision: changes to converge identifier definitions

    > > 2.1.8 -> 2.1.9 = 3 (0.0077%)

    2 decisions: fix for Greek numeral signs; fix for halfwidth forms light
    vertical

    > > 2.1.9 -> 3.0.0 = 85 (0.2182%)

    I'd have to dig further for this, but these were likely mostly
    changes involved in nailing down normalization for Unicode 3.0.

    > > 3.0.0 -> 3.0.1 = 0 (0%)
    > > 3.0.1 -> 3.1.0 = 3 (0.0061%)

    1 decision: 3 Runic golden numbers

    > > 3.1.0 -> 3.2.0 = 7 (0.0074%)

    5 decisions: 2 fixes for Khmer signs, 1 for Tamil aytham, 1 for
    Arabic end of ayah (architectural), 1 for the 3 Mongolian free
    variation selectors

    > > 3.2.0 -> 4.0.0 = 16 (0.0168%)

    2 decisions: 1 fix for 12 modifier letters, 1 fix for decimal digit
    alignment

    > > 4.0.0 -> 4.0.1 = 1 (0.0010%)

    1 decision: fix for ZWSP

    > > 4.0.1 -> 4.1.0 = 12 (0.0124%)

    3 decisions: 1 fix for Ethiopic digits, 1 for 2 Katakana middle dots,
    1 for Yi syllable wu

    > >
    > > I don't know what this tells you about the stability of the UCD data though.

    The significant point of instability in General Category
    assignments was in establishing Unicode 2.0 data files
    (now more than 8 years in the past).

    There was a significant hiccup for Unicode 3.0, at the point
    when it became clear that normalization stability was going
    to be a major issue, and when the data was culled for
    consistency under canonical and compatibility equivalence.

    Since that time, the UTC has been very conservative, indeed,
    in approving any General Category change for an existing
    character. The types of changes have been limited to:

      A. Clarification regarding obscure characters for which
         insufficient information was available earlier.
         
      B. Establishment of further data consistency constraints
         (this impacted some numeric categories, and also
         explains the change for the Katakana middle dot)
         
      C. Implementation issues with a few format characters
         (ZWSP, Arabic end of ayah, Mongolian free variation selectors)
         
    Since the publication of Unicode 3.0 in 2000, the only
    significantly common-use characters that had any General
    Category change were:

       U+0B83 TAMIL SIGN VISARGA (=aytham, Tamil data)
       U+200B ZERO WIDTH SPACE (mostly relevant to Thai data)
       U+30FB KATAKANA MIDDLE DOT (Japanese)
       
    Of those 3, only U+30FB would exist in any commonly
    interchanged character set other than Unicode, and
    *that* change was merely to
    change a punctuation subclass (gc=Pc --> gc=Po) -- and
    was additionally a *reversion* to the General Category
    assignment that U+30FB had in 2.1.5 and earlier.
         
    --Ken

         



    This archive was generated by hypermail 2.1.5 : Fri Mar 11 2005 - 20:25:10 CST