Re: Does Unicode 4.1 change NFC?

From: Philippe VERDY (verdy_p@wanadoo.fr)
Date: Sun Apr 03 2005 - 15:57:24 CST

  • Next message: Patrick Andries: "Re: Security Issues"

    "Marcin 'Qrczak' Kowalczyk" <qrczak@knm.org.pl>
    > I said *new* CJK compatibility ideographs. U+FA70..U+FAD9 were
    > unassigned in earlier versions of Unicode.

    I have just checked the new UCD, and you're right (but the previous message saying that NFC was changed in 4.1 was wrong, or severely misleading, and I was not his author).

    So for reference, the new characters from the UCD are:

    # Newly assigned in Unicode 4.1.0 (XXX, 2005)

    0237..0241 ; 4.1 # [11] LATIN SMALL LETTER DOTLESS J..LATIN CAPITAL LETTER GLOTTAL STOP
    0358..035C ; 4.1 # [5] COMBINING DOT ABOVE RIGHT..COMBINING DOUBLE BREVE BELOW
    03FC..03FF ; 4.1 # [4] GREEK RHO WITH STROKE SYMBOL..GREEK CAPITAL REVERSED DOTTED LUNATE SIGMA SYMBOL
    04F6..04F7 ; 4.1 # [2] CYRILLIC CAPITAL LETTER GHE WITH DESCENDER..CYRILLIC SMALL LETTER GHE WITH DESCENDER
    05A2 ; 4.1 # HEBREW ACCENT ATNAH HAFUKH
    05C5..05C7 ; 4.1 # [3] HEBREW MARK LOWER DOT..HEBREW POINT QAMATS QATAN
    060B ; 4.1 # AFGHANI SIGN
    061E ; 4.1 # ARABIC TRIPLE DOT PUNCTUATION MARK
    0659..065E ; 4.1 # [6] ARABIC ZWARAKAY..ARABIC FATHA WITH TWO DOTS
    0750..076D ; 4.1 # [30] ARABIC LETTER BEH WITH THREE DOTS HORIZONTALLY BELOW..ARABIC LETTER SEEN WITH TWO DOTS VERTICALLY ABOVE
    097D ; 4.1 # DEVANAGARI LETTER GLOTTAL STOP
    09CE ; 4.1 # BENGALI LETTER KHANDA TA
    0BB6 ; 4.1 # TAMIL LETTER SHA
    0BE6 ; 4.1 # TAMIL DIGIT ZERO
    0FD0..0FD1 ; 4.1 # [2] TIBETAN MARK BSKA- SHOG GI MGO RGYAN..TIBETAN MARK MNYAM YIG GI MGO RGYAN
    10F9..10FA ; 4.1 # [2] GEORGIAN LETTER TURNED GAN..GEORGIAN LETTER AIN
    10FC ; 4.1 # MODIFIER LETTER GEORGIAN NAR
    1207 ; 4.1 # ETHIOPIC SYLLABLE HOA
    1247 ; 4.1 # ETHIOPIC SYLLABLE QOA
    1287 ; 4.1 # ETHIOPIC SYLLABLE XOA
    12AF ; 4.1 # ETHIOPIC SYLLABLE KOA
    12CF ; 4.1 # ETHIOPIC SYLLABLE WOA
    12EF ; 4.1 # ETHIOPIC SYLLABLE YOA
    130F ; 4.1 # ETHIOPIC SYLLABLE GOA
    131F ; 4.1 # ETHIOPIC SYLLABLE GGWAA
    1347 ; 4.1 # ETHIOPIC SYLLABLE TZOA
    135F..1360 ; 4.1 # [2] ETHIOPIC COMBINING GEMINATION MARK..ETHIOPIC SECTION MARK
    1380..1399 ; 4.1 # [26] ETHIOPIC SYLLABLE SEBATBEIT MWA..ETHIOPIC TONAL MARK KURT
    1980..19A9 ; 4.1 # [42] NEW TAI LUE LETTER HIGH QA..NEW TAI LUE LETTER LOW XVA
    19B0..19C9 ; 4.1 # [26] NEW TAI LUE VOWEL SIGN VOWEL SHORTENER..NEW TAI LUE TONE MARK-2
    19D0..19D9 ; 4.1 # [10] NEW TAI LUE DIGIT ZERO..NEW TAI LUE DIGIT NINE
    19DE..19DF ; 4.1 # [2] NEW TAI LUE SIGN LAE..NEW TAI LUE SIGN LAEV
    1A00..1A1B ; 4.1 # [28] BUGINESE LETTER KA..BUGINESE VOWEL SIGN AE
    1A1E..1A1F ; 4.1 # [2] BUGINESE PALLAWA..BUGINESE END OF SECTION
    1D6C..1DC3 ; 4.1 # [88] LATIN SMALL LETTER B WITH MIDDLE TILDE..COMBINING SUSPENSION MARK
    2055..2056 ; 4.1 # [2] FLOWER PUNCTUATION MARK..THREE DOT PUNCTUATION
    2058..205E ; 4.1 # [7] FOUR DOT PUNCTUATION..VERTICAL FOUR DOTS
    2090..2094 ; 4.1 # [5] LATIN SUBSCRIPT SMALL LETTER A..LATIN SUBSCRIPT SMALL LETTER SCHWA
    20B2..20B5 ; 4.1 # [4] GUARANI SIGN..CEDI SIGN
    20EB ; 4.1 # COMBINING LONG DOUBLE SOLIDUS OVERLAY
    213C ; 4.1 # DOUBLE-STRUCK SMALL PI
    214C ; 4.1 # PER SIGN
    23D1..23DB ; 4.1 # [11] METRICAL BREVE..FUSE
    2618 ; 4.1 # SHAMROCK
    267E..267F ; 4.1 # [2] PERMANENT PAPER SIGN..WHEELCHAIR SYMBOL
    2692..269C ; 4.1 # [11] HAMMER AND PICK..FLEUR-DE-LIS
    26A2..26B1 ; 4.1 # [16] DOUBLED FEMALE SIGN..FUNERAL URN
    27C0..27C6 ; 4.1 # [7] THREE DIMENSIONAL ANGLE..RIGHT S-SHAPED BAG DELIMITER
    2B0E..2B13 ; 4.1 # [6] RIGHTWARDS ARROW WITH TIP DOWNWARDS..SQUARE WITH BOTTOM HALF BLACK
    2C00..2C2E ; 4.1 # [47] GLAGOLITIC CAPITAL LETTER AZU..GLAGOLITIC CAPITAL LETTER LATINATE MYSLITE
    2C30..2C5E ; 4.1 # [47] GLAGOLITIC SMALL LETTER AZU..GLAGOLITIC SMALL LETTER LATINATE MYSLITE
    2C80..2CEA ; 4.1 # [107] COPTIC CAPITAL LETTER ALFA..COPTIC SYMBOL SHIMA SIMA
    2CF9..2D25 ; 4.1 # [45] COPTIC OLD NUBIAN FULL STOP..GEORGIAN SMALL LETTER HOE
    2D30..2D65 ; 4.1 # [54] TIFINAGH LETTER YA..TIFINAGH LETTER YAZZ
    2D6F ; 4.1 # TIFINAGH MODIFIER LETTER LABIALIZATION MARK
    2D80..2D96 ; 4.1 # [23] ETHIOPIC SYLLABLE LOA..ETHIOPIC SYLLABLE GGWE
    2DA0..2DA6 ; 4.1 # [7] ETHIOPIC SYLLABLE SSA..ETHIOPIC SYLLABLE SSO
    2DA8..2DAE ; 4.1 # [7] ETHIOPIC SYLLABLE CCA..ETHIOPIC SYLLABLE CCO
    2DB0..2DB6 ; 4.1 # [7] ETHIOPIC SYLLABLE ZZA..ETHIOPIC SYLLABLE ZZO
    2DB8..2DBE ; 4.1 # [7] ETHIOPIC SYLLABLE CCHA..ETHIOPIC SYLLABLE CCHO
    2DC0..2DC6 ; 4.1 # [7] ETHIOPIC SYLLABLE QYA..ETHIOPIC SYLLABLE QYO
    2DC8..2DCE ; 4.1 # [7] ETHIOPIC SYLLABLE KYA..ETHIOPIC SYLLABLE KYO
    2DD0..2DD6 ; 4.1 # [7] ETHIOPIC SYLLABLE XYA..ETHIOPIC SYLLABLE XYO
    2DD8..2DDE ; 4.1 # [7] ETHIOPIC SYLLABLE GYA..ETHIOPIC SYLLABLE GYO
    2E00..2E17 ; 4.1 # [24] RIGHT ANGLE SUBSTITUTION MARKER..DOUBLE OBLIQUE HYPHEN
    2E1C..2E1D ; 4.1 # [2] LEFT LOW PARAPHRASE BRACKET..RIGHT LOW PARAPHRASE BRACKET
    31C0..31CF ; 4.1 # [16] CJK STROKE T..CJK STROKE N
    327E ; 4.1 # CIRCLED HANGUL IEUNG U
    9FA6..9FBB ; 4.1 # [22] CJK UNIFIED IDEOGRAPH-9FA6..CJK UNIFIED IDEOGRAPH-9FBB
    A700..A716 ; 4.1 # [23] MODIFIER LETTER CHINESE TONE YIN PING..MODIFIER LETTER EXTRA-LOW LEFT-STEM TONE BAR
    A800..A82B ; 4.1 # [44] SYLOTI NAGRI LETTER A..SYLOTI NAGRI POETRY MARK-4
    FA70..FAD9 ; 4.1 # [106] CJK COMPATIBILITY IDEOGRAPH-FA70..CJK COMPATIBILITY IDEOGRAPH-FAD9
    FE10..FE19 ; 4.1 # [10] PRESENTATION FORM FOR VERTICAL COMMA..PRESENTATION FORM FOR VERTICAL HORIZONTAL ELLIPSIS
    10140..1018A ; 4.1 # [75] GREEK ACROPHONIC ATTIC ONE QUARTER..GREEK ZERO SIGN
    103A0..103C3 ; 4.1 # [36] OLD PERSIAN SIGN A..OLD PERSIAN SIGN HA
    103C8..103D5 ; 4.1 # [14] OLD PERSIAN SIGN AURAMAZDAA..OLD PERSIAN NUMBER HUNDRED
    10A00..10A03 ; 4.1 # [4] KHAROSHTHI LETTER A..KHAROSHTHI VOWEL SIGN VOCALIC R
    10A05..10A06 ; 4.1 # [2] KHAROSHTHI VOWEL SIGN E..KHAROSHTHI VOWEL SIGN O
    10A0C..10A13 ; 4.1 # [8] KHAROSHTHI VOWEL LENGTH MARK..KHAROSHTHI LETTER GHA
    10A15..10A17 ; 4.1 # [3] KHAROSHTHI LETTER CA..KHAROSHTHI LETTER JA
    10A19..10A33 ; 4.1 # [27] KHAROSHTHI LETTER NYA..KHAROSHTHI LETTER TTTHA
    10A38..10A3A ; 4.1 # [3] KHAROSHTHI SIGN BAR ABOVE..KHAROSHTHI SIGN DOT BELOW
    10A3F..10A47 ; 4.1 # [9] KHAROSHTHI VIRAMA..KHAROSHTHI NUMBER ONE THOUSAND
    10A50..10A58 ; 4.1 # [9] KHAROSHTHI PUNCTUATION DOT..KHAROSHTHI PUNCTUATION LINES
    1D200..1D245 ; 4.1 # [70] GREEK VOCAL NOTATION SYMBOL-1..GREEK MUSICAL LEIMMA
    1D6A4..1D6A5 ; 4.1 # [2] MATHEMATICAL ITALIC SMALL DOTLESS I..MATHEMATICAL ITALIC SMALL DOTLESS J

    # Total code points: 1273

    Yes a new normalizer is needed but only for newly encoded and *conforming* documents that include these codepoints.
    Otherwise the previous normalizer can still be used interchangeably.

    I am particularly interested, immediately, in the following new codepoints for Latin (all in the BMP):
    0237..0241 ; 4.1 # [11] LATIN SMALL LETTER DOTLESS J..LATIN CAPITAL LETTER GLOTTAL STOP
    0358..035C ; 4.1 # [5] COMBINING DOT ABOVE RIGHT..COMBINING DOUBLE BREVE BELOW
    1D6C..1DC3 ; 4.1 # [88] LATIN SMALL LETTER B WITH MIDDLE TILDE..COMBINING SUSPENSION MARK
    2090..2094 ; 4.1 # [5] LATIN SUBSCRIPT SMALL LETTER A..LATIN SUBSCRIPT SMALL LETTER SCHWA
    (and I think these new characters will interest much people)...



    This archive was generated by hypermail 2.1.5 : Sun Apr 03 2005 - 15:58:07 CST