Re: How many characters?

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Nov 23 2005 - 18:23:45 CST

  • Next message: Kenneth Whistler: "RE: Apostrophes (was Re: Exemplar Characters)"

    Mark said:

    > These figures depend on what precisely is meant by the label.

    Of course, but the labels have been intended to have precise
    meanings since we first started publishing historical lists of the
    "Number of Assigned Characters" in Unicode 3.0 back in 2000.
    It isn't as if we can just wave our arms around, say the
    labels mean whatever somebody might decide they mean, and
    then change the statistics every year based on that.

    > For
    > example, if Han Compatibility is taken as meaning:
    >
    > Ideographic=True and Decomposition_Type!=None
    >
    > Then divided by BMP or SMP, that gives (U4.1):
    >
    > [[:ideographic:]&[:^decomposition_type=none:]&[\u0000-\uFFFF]]
    > 399 Code Points
    >
    > [[:ideographic:]&[:^decomposition_type=none:]&[^\u0000-\uFFFF]]
    > 542 Code Points
    >
    > for a total of 941 Code Points. However, that includes 3 characters not
    > called CJK compatibility in their names. Or it could be going by the
    > block name (and then excluding unassigned code points).

    But of course, "Han Compatibility" in the stats doesn't mean the former.
    It means and always has meant the number of assigned characters in
    the following two blocks:

    F900..FAFF; CJK Compatibility Ideographs
    2F800..2FA1F; CJK Compatibility Ideographs Supplement

    despite the fact that 12 of the ideographs in the CJK Compatibility
    Ideographs block have the Unified_Ideograph property

    despite the fact that there are a number of characters outside
    of those blocks which have the Ideographic property, some of
    which also have decompositions

    >
    > Similarly, the label Alphabetics and Symbols is not actually Alphabetic
    > union Symbol: it is really (I guess) for the BMP
                                 ^^^^^^^^^
                                 
    No guessing necessary, since "Graphic" in Tables D-2 and Tables D-3
    of Unicode 4.0 was quite consciously and deliberately aligned with
    Table 2-2. And the "Alphabetic, Symbols" line is part of the summation
    of values that leads to Graphic characters as a subtotal.
                                 
    >
    >
    [[:gc=letter:][:gc=number:][:gc=symbol:][:gc=mark:][:gc=punctuation:][:gc=separa
    tor:]&[\u0000-\uFFFF]]
    > minus the other listed stuff:
    > Han (URO), Han Extension A, Han Compatibility, Hangul Syllables.
    >
    > Here is the breakdown I get for 4.1, using the main properties that we
    > list in Chapter 2.
    >
    >
    > BMP SMP All
    > [:gc=letter:] 46,618 44,777 91,395
    > [:gc=number:] 514 181 695
    > [:gc=symbol:] 3,339 619 3,958
    > [:gc=mark:] 723 286 1,009
    > [:gc=punctuation:] 428 12 440
    > [:gc=separator:] 20 0 20
    > *Subtotal:* *51,642* *45,875* *97,517*

    This subtotal is incorrect (as a count of Graphic
    characters), because it doesn't distinguish
    those [:gc=separator:] which *are* Graphic characters from
    those which are not.

    > [:gc=control:] 65 0 65
    > [:gc=format:] 33 105 138

    This is incorrect, and is making the same mistake that Peter made
    first, and then Asmus.

    Format controls as defined in Chapter 2 are the union of
    gc=Cf and gc=Zl and gc=Zp.

     
    > **
    > Mark
    >
    > BTW, we really should stop dispense with the now-artificial distinction
    > between SMP and BMP for the figures nowadays.

    I disagree.

    The distinction is a real, not artificial one. And while there
    are rhetorical reasons for deemphasizing it and encouraging everyone
    to implement all code points equally, I am not in favor if
    scrubbing the stats of the differences.

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Nov 23 2005 - 18:25:47 CST