Re: How many characters?

From: Andrew West (andrewcwest@gmail.com)
Date: Wed Nov 23 2005 - 06:26:28 CST

  • Next message: Philippe Verdy: "Re: Hebrew script in IDN"

    On 22/11/05, Kenneth Whistler <kenw@sybase.com> wrote:
    >
    > Unicode 4.1:
    >
    > 51644 graphic characters assigned (BMP)
    > 31 format control characters assigned (BMP)
    > 65 control characters assigned (BMP)
    > 6400 private use characters assigned (BMP)
    > 2048 surrogate code points designated (BMP)
    > 34 noncharacter code points designated (BMP)
    > 5314 reserved code points (BMP)
    > 45980 graphic characters assigned (supplementary planes)
    > 131068 private use characters assigned (supplementary planes)
    > 32 noncharacter code points designated (supplementary planes)
    > 871496 reserved code points (supplementary planes)
    > ------------------------------------------------------------------
    > 1114112 code points altogether
    >
    > Unicode 5.0:
    >
    > 51986 graphic characters assigned (BMP)
    > 31 format control characters assigned (BMP)
    > 65 control characters assigned (BMP)
    > 6400 private use characters assigned (BMP)
    > 2048 surrogate code points designated (BMP)
    > 34 noncharacter code points designated (BMP)
    > 4972 reserved code points (BMP)
    > 47007 graphic characters assigned (supplementary planes)
    > 131068 private use characters assigned (supplementary planes)
    > 32 noncharacter code points designated (supplementary planes)
    > 870469 reserved code points (supplementary planes)
    > ------------------------------------------------------------------
    > 1114112 code points altogether
    >

    Ken may perhaps have forgotten that the 4.0 figures wrongly count five
    format characters as graphic characters, and so after adjusting for
    the longstanding out by two error the 4.1 figures for format
    characters are still out by four due to the change in GC of U+200B to
    Cf in 4.0.1. By my calculations the correct values for 4.1 are:

    Unicode 4.1:

     51640 graphic characters assigned (BMP)
        35 format control characters assigned (BMP)
        65 control characters assigned (BMP)
      6400 private use characters assigned (BMP)
      2048 surrogate code points designated (BMP)
        34 noncharacter code points designated (BMP)
      5314 reserved code points (BMP)
     45875 graphic characters assigned (supplementary planes)
       105 format characters assigned (supplementary planes)
    131068 private use characters assigned (supplementary planes)
        32 noncharacter code points designated (supplementary planes)
    871496 reserved code points (supplementary planes)
    ------------------------------------------------------------------
    1114112 code points altogether

    Based on the latest publicly available version of the 5.0 UCD data, I
    get the following figures for 5.0. My figures have two less BMP and
    two more SMP characters than Ken's figures, but I haven't
    cross-checked with N2991 yet (N2991 states there are 1,359 new
    characters, but this must be a typo for 1,369), so I'm not sure who's
    correct.

    Unicode 5.0:

     51980 graphic characters assigned (BMP)
        35 format control characters assigned (BMP)
        65 control characters assigned (BMP)
      6400 private use characters assigned (BMP)
      2048 surrogate code points designated (BMP)
        34 noncharacter code points designated (BMP)
      4974 reserved code points (BMP)
     46904 graphic characters assigned (supplementary planes)
       105 format characters assigned (supplementary planes)
    131068 private use characters assigned (supplementary planes)
        32 noncharacter code points designated (supplementary planes)
    870467 reserved code points (supplementary planes)
    ------------------------------------------------------------------
    1114112 code points altogether

    Andrew



    This archive was generated by hypermail 2.1.5 : Wed Nov 23 2005 - 06:28:53 CST