RE: How many characters?

From: Peter Constable (petercon@microsoft.com)
Date: Wed Nov 23 2005 - 10:07:19 CST

  • Next message: Jony Rosenne: "RE: Hebrew script in IDN"

    [I replied earlier, but that response seems to have gotten lost.]

    I think both you and Ken are wrong re 4.1. For the BMP, I did a hand count of Cf characters, and came up with 33, not 31 or 35. I also did counts on various categories of graphic characters and got the following:

    Alphabetics, Symbols: 12,497
    Han (URO): 20,927
    Han Extension A: 6,582
    Han Compatibility: 467
    Hangul Syllables: 11,172
    Total Graphic characters: 51,642

    Thus, I get the following for 4.1:

    Unicode 4.1:
     
       51640 graphic characters assigned (BMP)
          35 format control characters assigned (BMP)
          65 control characters assigned (BMP)
        6400 private use characters assigned (BMP)
        2048 surrogate code points designated (BMP)
          34 noncharacter code points designated (BMP)
        5314 reserved code points (BMP)
       45875 graphic characters assigned (supplementary planes)
         105 format characters assigned (supplementary planes)
      131068 private use characters assigned (supplementary planes)
          32 noncharacter code points designated (supplementary planes)
      871496 reserved code points (supplementary planes)
    ------------------------------------------------------------------
     1114112 code points altogether

    Peter Constable

    > -----Original Message-----
    > From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On
    > Behalf Of Andrew West
    > Sent: Wednesday, November 23, 2005 4:26 AM
    > To: unicode@unicode.org
    > Subject: Re: How many characters?
    >
    > On 22/11/05, Kenneth Whistler <kenw@sybase.com> wrote:
    > >
    > > Unicode 4.1:
    > >
    > > 51644 graphic characters assigned (BMP)
    > > 31 format control characters assigned (BMP)
    > > 65 control characters assigned (BMP)
    > > 6400 private use characters assigned (BMP)
    > > 2048 surrogate code points designated (BMP)
    > > 34 noncharacter code points designated (BMP)
    > > 5314 reserved code points (BMP)
    > > 45980 graphic characters assigned (supplementary planes)
    > > 131068 private use characters assigned (supplementary planes)
    > > 32 noncharacter code points designated (supplementary planes)
    > > 871496 reserved code points (supplementary planes)
    > > ------------------------------------------------------------------
    > > 1114112 code points altogether
    > >
    > > Unicode 5.0:
    > >
    > > 51986 graphic characters assigned (BMP)
    > > 31 format control characters assigned (BMP)
    > > 65 control characters assigned (BMP)
    > > 6400 private use characters assigned (BMP)
    > > 2048 surrogate code points designated (BMP)
    > > 34 noncharacter code points designated (BMP)
    > > 4972 reserved code points (BMP)
    > > 47007 graphic characters assigned (supplementary planes)
    > > 131068 private use characters assigned (supplementary planes)
    > > 32 noncharacter code points designated (supplementary planes)
    > > 870469 reserved code points (supplementary planes)
    > > ------------------------------------------------------------------
    > > 1114112 code points altogether
    > >
    >
    > Ken may perhaps have forgotten that the 4.0 figures wrongly count five
    > format characters as graphic characters, and so after adjusting for
    > the longstanding out by two error the 4.1 figures for format
    > characters are still out by four due to the change in GC of U+200B to
    > Cf in 4.0.1. By my calculations the correct values for 4.1 are:
    >
    > Unicode 4.1:
    >
    > 51640 graphic characters assigned (BMP)
    > 35 format control characters assigned (BMP)
    > 65 control characters assigned (BMP)
    > 6400 private use characters assigned (BMP)
    > 2048 surrogate code points designated (BMP)
    > 34 noncharacter code points designated (BMP)
    > 5314 reserved code points (BMP)
    > 45875 graphic characters assigned (supplementary planes)
    > 105 format characters assigned (supplementary planes)
    > 131068 private use characters assigned (supplementary planes)
    > 32 noncharacter code points designated (supplementary planes)
    > 871496 reserved code points (supplementary planes)
    > ------------------------------------------------------------------
    > 1114112 code points altogether
    >
    > Based on the latest publicly available version of the 5.0 UCD data, I
    > get the following figures for 5.0. My figures have two less BMP and
    > two more SMP characters than Ken's figures, but I haven't
    > cross-checked with N2991 yet (N2991 states there are 1,359 new
    > characters, but this must be a typo for 1,369), so I'm not sure who's
    > correct.
    >
    > Unicode 5.0:
    >
    > 51980 graphic characters assigned (BMP)
    > 35 format control characters assigned (BMP)
    > 65 control characters assigned (BMP)
    > 6400 private use characters assigned (BMP)
    > 2048 surrogate code points designated (BMP)
    > 34 noncharacter code points designated (BMP)
    > 4974 reserved code points (BMP)
    > 46904 graphic characters assigned (supplementary planes)
    > 105 format characters assigned (supplementary planes)
    > 131068 private use characters assigned (supplementary planes)
    > 32 noncharacter code points designated (supplementary planes)
    > 870467 reserved code points (supplementary planes)
    > ------------------------------------------------------------------
    > 1114112 code points altogether
    >
    > Andrew
    >



    This archive was generated by hypermail 2.1.5 : Wed Nov 23 2005 - 10:20:58 CST