From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Nov 23 2005 - 18:23:45 CST
Mark said:
> These figures depend on what precisely is meant by the label.
Of course, but the labels have been intended to have precise
meanings since we first started publishing historical lists of the
"Number of Assigned Characters" in Unicode 3.0 back in 2000.
It isn't as if we can just wave our arms around, say the
labels mean whatever somebody might decide they mean, and
then change the statistics every year based on that.
> For
> example, if Han Compatibility is taken as meaning:
>
> Ideographic=True and Decomposition_Type!=None
>
> Then divided by BMP or SMP, that gives (U4.1):
>
> [[:ideographic:]&[:^decomposition_type=none:]&[\u0000-\uFFFF]]
> 399 Code Points
>
> [[:ideographic:]&[:^decomposition_type=none:]&[^\u0000-\uFFFF]]
> 542 Code Points
>
> for a total of 941 Code Points. However, that includes 3 characters not
> called CJK compatibility in their names. Or it could be going by the
> block name (and then excluding unassigned code points).
But of course, "Han Compatibility" in the stats doesn't mean the former.
It means and always has meant the number of assigned characters in
the following two blocks:
F900..FAFF; CJK Compatibility Ideographs
2F800..2FA1F; CJK Compatibility Ideographs Supplement
despite the fact that 12 of the ideographs in the CJK Compatibility
Ideographs block have the Unified_Ideograph property
despite the fact that there are a number of characters outside
of those blocks which have the Ideographic property, some of
which also have decompositions
>
> Similarly, the label Alphabetics and Symbols is not actually Alphabetic
> union Symbol: it is really (I guess) for the BMP
^^^^^^^^^
No guessing necessary, since "Graphic" in Tables D-2 and Tables D-3
of Unicode 4.0 was quite consciously and deliberately aligned with
Table 2-2. And the "Alphabetic, Symbols" line is part of the summation
of values that leads to Graphic characters as a subtotal.
>
>
[[:gc=letter:][:gc=number:][:gc=symbol:][:gc=mark:][:gc=punctuation:][:gc=separa
tor:]&[\u0000-\uFFFF]]
> minus the other listed stuff:
> Han (URO), Han Extension A, Han Compatibility, Hangul Syllables.
>
> Here is the breakdown I get for 4.1, using the main properties that we
> list in Chapter 2.
>
>
> BMP SMP All
> [:gc=letter:] 46,618 44,777 91,395
> [:gc=number:] 514 181 695
> [:gc=symbol:] 3,339 619 3,958
> [:gc=mark:] 723 286 1,009
> [:gc=punctuation:] 428 12 440
> [:gc=separator:] 20 0 20
> *Subtotal:* *51,642* *45,875* *97,517*
This subtotal is incorrect (as a count of Graphic
characters), because it doesn't distinguish
those [:gc=separator:] which *are* Graphic characters from
those which are not.
> [:gc=control:] 65 0 65
> [:gc=format:] 33 105 138
This is incorrect, and is making the same mistake that Peter made
first, and then Asmus.
Format controls as defined in Chapter 2 are the union of
gc=Cf and gc=Zl and gc=Zp.
> **
> Mark
>
> BTW, we really should stop dispense with the now-artificial distinction
> between SMP and BMP for the figures nowadays.
I disagree.
The distinction is a real, not artificial one. And while there
are rhetorical reasons for deemphasizing it and encouraging everyone
to implement all code points equally, I am not in favor if
scrubbing the stats of the differences.
--Ken
This archive was generated by hypermail 2.1.5 : Wed Nov 23 2005 - 18:25:47 CST