From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Wed Nov 23 2005 - 14:01:57 CST
On 11/23/2005 7:43 AM, Peter Constable wrote:
>By my calculations, both you and Ken have errors in your 4.1 statistics.
>
>Re the BMP: Doing a hand count of Cf characters in TUS4.1, I come up with 33. Not 31, not 35.
>
I also find 33, as follows:
C:\UniDev\data\UNIDATA-4.1.0>findstr /B /R
[A-F0-9][A-F0-9][A-F0-9][A-F0-9]; Uni
codeData.txt | findstr ;Cf; | wc
33 96 1555
The first 'findstr' limits the search to BMP (4 digit code points), the
second searches for Cf in the result, and the 'wc' counts the lines found.
> And I came up with the following counts for graphic characters in Unicode 4.1:
>
>Alphabetics, Symbols: 12,497
>
>
I find 12,964 characters that are [LMNPSZ][a-z], which, given 467 Han
Compat (not 457)
gives 12,497
>Han (URO): 20,924
>Han Extension A: 6,582
>Han Compatibility: 457
>
>
I find 467 as follows:
C:\UniDev\data\UNIDATA-4.1.0>findstr /B /R
[A-F0-9][A-F0-9][A-F0-9][A-F0-9]; Uni
codeData.txt | findstr IDEOGRAPH- | wc
467 1401 27980
These are all the characters with "IDEOGRAPH-" in their name
>Hangul Syllables: 11,172
>Total Graphic characters: 51,642
>
>
>Re the supplementary planes: My numbers agree with yours.
>
>Overall, then, I believe the correct numbers for TUS4.1 are as follows:
>
>Unicode 4.1:
>
> 51642 graphic characters assigned (BMP)
> 33 format control characters assigned (BMP)
> 65 control characters assigned (BMP)
> 6400 private use characters assigned (BMP)
> 2048 surrogate code points designated (BMP)
> 34 noncharacter code points designated (BMP)
> 5314 reserved code points (BMP)
> 45875 graphic characters assigned (supplementary planes)
> 105 format characters assigned (supplementary planes)
> 131068 private use characters assigned (supplementary planes)
> 32 noncharacter code points designated (supplementary planes)
> 871496 reserved code points (supplementary planes)
>------------------------------------------------------------------
> 1114112 code points altogether
>
>
>I haven't looked at 5.0 numbers; let's see if we can agree on 4.1 numbers, though.
>
>
>Peter Constable
>
>
>
>
>>-----Original Message-----
>>From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On
>>Behalf Of Andrew West
>>Sent: Wednesday, November 23, 2005 4:26 AM
>>To: unicode@unicode.org
>>Subject: Re: How many characters?
>>
>>On 22/11/05, Kenneth Whistler <kenw@sybase.com> wrote:
>>
>>
>>>Unicode 4.1:
>>>
>>> 51644 graphic characters assigned (BMP)
>>> 31 format control characters assigned (BMP)
>>> 65 control characters assigned (BMP)
>>> 6400 private use characters assigned (BMP)
>>> 2048 surrogate code points designated (BMP)
>>> 34 noncharacter code points designated (BMP)
>>> 5314 reserved code points (BMP)
>>> 45980 graphic characters assigned (supplementary planes)
>>> 131068 private use characters assigned (supplementary planes)
>>> 32 noncharacter code points designated (supplementary planes)
>>> 871496 reserved code points (supplementary planes)
>>>------------------------------------------------------------------
>>>1114112 code points altogether
>>>
>>>Unicode 5.0:
>>>
>>> 51986 graphic characters assigned (BMP)
>>> 31 format control characters assigned (BMP)
>>> 65 control characters assigned (BMP)
>>> 6400 private use characters assigned (BMP)
>>> 2048 surrogate code points designated (BMP)
>>> 34 noncharacter code points designated (BMP)
>>> 4972 reserved code points (BMP)
>>> 47007 graphic characters assigned (supplementary planes)
>>> 131068 private use characters assigned (supplementary planes)
>>> 32 noncharacter code points designated (supplementary planes)
>>> 870469 reserved code points (supplementary planes)
>>>------------------------------------------------------------------
>>>1114112 code points altogether
>>>
>>>
>>>
>>Ken may perhaps have forgotten that the 4.0 figures wrongly count five
>>format characters as graphic characters, and so after adjusting for
>>the longstanding out by two error the 4.1 figures for format
>>characters are still out by four due to the change in GC of U+200B to
>>Cf in 4.0.1. By my calculations the correct values for 4.1 are:
>>
>>Unicode 4.1:
>>
>> 51640 graphic characters assigned (BMP)
>> 35 format control characters assigned (BMP)
>> 65 control characters assigned (BMP)
>> 6400 private use characters assigned (BMP)
>> 2048 surrogate code points designated (BMP)
>> 34 noncharacter code points designated (BMP)
>> 5314 reserved code points (BMP)
>> 45875 graphic characters assigned (supplementary planes)
>> 105 format characters assigned (supplementary planes)
>>131068 private use characters assigned (supplementary planes)
>> 32 noncharacter code points designated (supplementary planes)
>>871496 reserved code points (supplementary planes)
>>------------------------------------------------------------------
>>1114112 code points altogether
>>
>>Based on the latest publicly available version of the 5.0 UCD data, I
>>get the following figures for 5.0. My figures have two less BMP and
>>two more SMP characters than Ken's figures, but I haven't
>>cross-checked with N2991 yet (N2991 states there are 1,359 new
>>characters, but this must be a typo for 1,369), so I'm not sure who's
>>correct.
>>
>>Unicode 5.0:
>>
>> 51980 graphic characters assigned (BMP)
>> 35 format control characters assigned (BMP)
>> 65 control characters assigned (BMP)
>> 6400 private use characters assigned (BMP)
>> 2048 surrogate code points designated (BMP)
>> 34 noncharacter code points designated (BMP)
>> 4974 reserved code points (BMP)
>> 46904 graphic characters assigned (supplementary planes)
>> 105 format characters assigned (supplementary planes)
>>131068 private use characters assigned (supplementary planes)
>> 32 noncharacter code points designated (supplementary planes)
>>870467 reserved code points (supplementary planes)
>>------------------------------------------------------------------
>>1114112 code points altogether
>>
>>Andrew
>>
>>
>>
>
>
>
>
>
>
>
This archive was generated by hypermail 2.1.5 : Wed Nov 23 2005 - 14:03:02 CST