L2/04-083
The Script property for 4.0 characters
Eric Muller, Adobe Systems Inc.
February 2, 2004
Document History
In looking at the script property for the combining
characters, I noticed a couple of strange things:
-
the variation selectors in the BMP, U+FE00..U+FE0F have
the script INHERITED, while those in plane 14, U+E0100..U+E01EF,
have the script COMMON.
-
The Indic SIGN NUKTA characters have their corresponding
script (i.e. 093C DEVANAGARI SIGN NUKTA has the script
DEVANAGARI), except for U+0CBC KANNADA SIGN NUKTA which has the
script COMMON.
Looking at bit more carefully, I noticed the following
pattern: all the combining characters with the script COMMON are new
in Unicode 4.0; conversely, from the combining characters new in
4.0, the Gujarati and Limbu ones have their respective scripts while
all the others have the COMMON script.
Add the fact that COMMON is the script for characters not
listed explicitely in Scripts.txt, and I believe that we essentially
forgot to assign the script property for most 4.0 combining
characters, Checking a bit further, I believe that statement extends
to base characters as well (although not every instance of COMMON is
highly suspiscious for those).
I don’t know if we forgot to do the work, or if we
forgot or lost the update to Scripts.txt. It may be worth tracking
what did or did not happen, so as to fix our process.
Here is an attempt to repare this. These are all the 4.0
characters with the COMMON script, together with a proposed change
if needed. I based the proposed assignments on similarity with
pre-4.0 characters, as noted.
-
02EF Sk MODIFIER LETTER LOW DOWN ARROWHEAD
02F0 Sk MODIFIER LETTER LOW UP ARROWHEAD
02F1 Sk MODIFIER LETTER LOW LEFT ARROWHEAD
02F2 Sk MODIFIER LETTER LOW RIGHT ARROWHEAD
02F3 Sk MODIFIER LETTER LOW RING
02F4 Sk MODIFIER LETTER MIDDLE GRAVE ACCENT
02F5 Sk MODIFIER LETTER MIDDLE DOUBLE GRAVE ACCENT
02F6 Sk MODIFIER LETTER MIDDLE DOUBLE ACUTE ACCENT
02F7 Sk MODIFIER LETTER LOW TILDE
02F8 Sk MODIFIER LETTER RAISED COLON
02F9 Sk MODIFIER LETTER BEGIN HIGH TONE
02FA Sk MODIFIER LETTER END HIGH TONE
02FB Sk MODIFIER LETTER BEGIN LOW TONE
02FC Sk MODIFIER LETTER END LOW TONE
02FD Sk MODIFIER LETTER SHELF
02FE Sk MODIFIER LETTER OPEN SHELF
02FF Sk MODIFIER LETTER LOW LEFT ARROW
Those are probably ok, they match the assignments for U+02B9..U+02DF
-
0350 Mn COMBINING RIGHT ARROWHEAD ABOVE
0351 Mn COMBINING LEFT HALF RING ABOVE
0352 Mn COMBINING FERMATA
0353 Mn COMBINING X BELOW
0354 Mn COMBINING LEFT ARROWHEAD BELOW
0355 Mn COMBINING RIGHT ARROWHEAD BELOW
0356 Mn COMBINING RIGHT ARROWHEAD AND UP ARROWHEAD BELOW
0357 Mn COMBINING RIGHT HALF RING ABOVE
035D Mn COMBINING DOUBLE BREVE
035E Mn COMBINING DOUBLE MACRON
035F Mn COMBINING DOUBLE MACRON BELOW
INHERITED, to match the other characters in the Combinining Diacritical Marks block.
-
0600 Cf ARABIC NUMBER SIGN
0601 Cf ARABIC SIGN SANAH
0602 Cf ARABIC FOOTNOTE MARKER
0603 Cf ARABIC SIGN SAFHA
INHERITED, to match U+06DD ARABIC END OF AYAH, the only pre-4.0
Cf character in the Arabic block.
-
060D Po ARABIC DATE SEPARATOR
060E So ARABIC POETIC VERSE SIGN
060F So ARABIC SIGN MISRA
COMMON is probably ok, to match the other Po and So in the Arabic block.
-
0610 Mn ARABIC SIGN SALLALLAHOU ALAYHE WASSALLAM
0611 Mn ARABIC SIGN ALAYHE ASSALLAM
0612 Mn ARABIC SIGN RAHMATULLAH ALAYHE
0613 Mn ARABIC SIGN RADI ALLAHOU ANHU
0614 Mn ARABIC SIGN TAKHALLUS
0615 Mn ARABIC SMALL HIGH TAH
0656 Mn ARABIC SUBSCRIPT ALEF
0657 Mn ARABIC INVERTED DAMMA
0658 Mn ARABIC MARK NOON GHUNNA
INHERITED to match the other combining characters in the Arabic block.
-
0A01 Mn GURMUKHI SIGN ADAK BINDI
GURMUKHI to match the other GURMUHKI combining characters.
-
0AF1 Sc GUJARATI RUPEE SIGN
COMMON to match U+09F3 BENGALI RUPEE SIGN
-
0BF3 So TAMIL DAY SIGN
0BF4 So TAMIL MONTH SIGN
0BF5 So TAMIL YEAR SIGN
0BF6 So TAMIL DEBIT SIGN
0BF7 So TAMIL CREDIT SIGN
0BF8 So TAMIL AS ABOVE SIGN
0BF9 Sc TAMIL RUPEE SIGN
0BFA So TAMIL NUMBER SIGN
COMMON to match U+09F3 BENGALI RUPEE SIGN
-
0BF3 So TAMIL DAY SIGN
0BF4 So TAMIL MONTH SIGN
0BF5 So TAMIL YEAR SIGN
0BF6 So TAMIL DEBIT SIGN
0BF7 So TAMIL CREDIT SIGN
0BF8 So TAMIL AS ABOVE SIGN
0BFA So TAMIL NUMBER SIGN
Not sure.
-
0CBC Mn KANNADA SIGN NUKTA
KANNADA, to match the other Indic nuktas
-
17DD Mn KHMER SIGN ATTHACAN
KHMER to match the other Khmer combining characters
-
17F0 No KHMER SYMBOL LEK ATTAK SON
17F1 No KHMER SYMBOL LEK ATTAK MUOY
17F2 No KHMER SYMBOL LEK ATTAK PII
17F3 No KHMER SYMBOL LEK ATTAK BEI
17F4 No KHMER SYMBOL LEK ATTAK BUON
17F5 No KHMER SYMBOL LEK ATTAK PRAM
17F6 No KHMER SYMBOL LEK ATTAK PRAM-MUOY
17F7 No KHMER SYMBOL LEK ATTAK PRAM-PII
17F8 No KHMER SYMBOL LEK ATTAK PRAM-BEI
17F9 No KHMER SYMBOL LEK ATTAK PRAM-BUON
COMMON to match U+17D7 KHMER SIGN LEK TOO.
-
1940 So LIMBU SIGN LOO
Not sure
-
1944 Po LIMBU EXCLAMATION MARK
1945 Po LIMBU QUESTION MARK
COMMON to match the other xxx QUESTION/EXCLAMATION MARK
-
19E0 So KHMER SYMBOL PATHAMASAT
19E1 So KHMER SYMBOL MUOY KOET
19E2 So KHMER SYMBOL PII KOET
19E3 So KHMER SYMBOL BEI KOET
19E4 So KHMER SYMBOL BUON KOET
19E5 So KHMER SYMBOL PRAM KOET
19E6 So KHMER SYMBOL PRAM-MUOY KOET
19E7 So KHMER SYMBOL PRAM-PII KOET
19E8 So KHMER SYMBOL PRAM-BEI KOET
19E9 So KHMER SYMBOL PRAM-BUON KOET
19EA So KHMER SYMBOL DAP KOET
19EB So KHMER SYMBOL DAP-MUOY KOET
19EC So KHMER SYMBOL DAP-PII KOET
19ED So KHMER SYMBOL DAP-BEI KOET
19EE So KHMER SYMBOL DAP-BUON KOET
19EF So KHMER SYMBOL DAP-PRAM KOET
19F0 So KHMER SYMBOL TUTEYASAT
19F1 So KHMER SYMBOL MUOY ROC
19F2 So KHMER SYMBOL PII ROC
19F3 So KHMER SYMBOL BEI ROC
19F4 So KHMER SYMBOL BUON ROC
19F5 So KHMER SYMBOL PRAM ROC
19F6 So KHMER SYMBOL PRAM-MUOY ROC
19F7 So KHMER SYMBOL PRAM-PII ROC
19F8 So KHMER SYMBOL PRAM-BEI ROC
19F9 So KHMER SYMBOL PRAM-BUON ROC
19FA So KHMER SYMBOL DAP ROC
19FB So KHMER SYMBOL DAP-MUOY ROC
19FC So KHMER SYMBOL DAP-PII ROC
19FD So KHMER SYMBOL DAP-BEI ROC
19FE So KHMER SYMBOL DAP-BUON ROC
19FF So KHMER SYMBOL DAP-PRAM ROC
Note sure
-
2053 Po SWUNG DASH
2054 Pc INVERTED UNDERTIE
213B So FACSIMILE SIGN
23CF So EJECT SYMBOL
23D0 So VERTICAL LINE EXTENSION
24FF No NEGATIVE CIRCLED DIGIT ZERO
2614 So UMBRELLA WITH RAIN DROPS
2615 So HOT BEVERAGE
COMMON is fine
-
268A So MONOGRAM FOR YANG
268B So MONOGRAM FOR YIN
268C So DIGRAM FOR GREATER YANG
268D So DIGRAM FOR LESSER YIN
268E So DIGRAM FOR LESSER YANG
268F So DIGRAM FOR GREATER YIN
Not sure
-
2690 So WHITE FLAG
2691 So BLACK FLAG
26A0 So WARNING SIGN
26A1 So HIGH VOLTAGE SIGN
2B00 So NORTH EAST WHITE ARROW
2B01 So NORTH WEST WHITE ARROW
2B02 So SOUTH EAST WHITE ARROW
2B03 So SOUTH WEST WHITE ARROW
2B04 So LEFT RIGHT WHITE ARROW
2B05 So LEFTWARDS BLACK ARROW
2B06 So UPWARDS BLACK ARROW
2B07 So DOWNWARDS BLACK ARROW
2B08 So NORTH EAST BLACK ARROW
2B09 So NORTH WEST BLACK ARROW
2B0A So SOUTH EAST BLACK ARROW
2B0B So SOUTH WEST BLACK ARROW
2B0C So LEFT RIGHT BLACK ARROW
2B0D So UP DOWN BLACK ARROW
COMMON is fine
-
321D So PARENTHESIZED KOREAN CHARACTER OJEON
321E So PARENTHESIZED KOREAN CHARACTER O HU
3250 So PARTNERSHIP SIGN
327C So CIRCLED KOREAN CHARACTER CHAMKO
327D So CIRCLED KOREAN CHARACTER JUEUI
32CC So SQUARE HG
32CD So SQUARE ERG
32CE So SQUARE EV
32CF So LIMITED LIABILITY SIGN
COMMON matches all the other characters in that block.
-
3377 So SQUARE DM
3378 So SQUARE DM SQUARED
3379 So SQUARE DM CUBED
337A So SQUARE IU
33DE So SQUARE V OVER M
33DF So SQUARE A OVER M
33FF So SQUARE GAL
COMMON matches all the other characters in that block.
-
4DC0 So HEXAGRAM FOR THE CREATIVE HEAVEN
...
4DFF So HEXAGRAM FOR BEFORE COMPLETION
New script?
-
FDFD So ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM
ARABIC, to match the other U+FDFx ligatures (except U+FDFC
RIAL SIGN, which is COMMON like the other currencies)
-
FE47 Ps PRESENTATION FORM FOR VERTICAL LEFT SQUARE BRACKET
FE48 Pe PRESENTATION FORM FOR VERTICAL RIGHT SQUARE BRACKET
COMMON matches the rest of the block
-
10100 Po AEGEAN WORD SEPARATOR LINE
...
1013F So AEGEAN MEASURE THIRD SUBUNIT
New script?
-
1039F common Po 4.0 UGARITIC WORD DIVIDER
COMMON matches other punctutations.
-
1D300 So MONOGRAM FOR EARTH
...
1D356 So TETRAGRAM FOR FOSTERING
New script?
-
1D4C1 Ll MATHEMATICAL SCRIPT SMALL L
COMMON matches the other MATHEMATICAL characters.
-
E0100 Mn VARIATION SELECTOR-17
...
E01EF Mn VARIATION SELECTOR-256
INHERITED, to match the BMP VARIATION SELECTOR-xx characters
Assuming that the overall assessment is correct, and given the
scope of the changes, I would strongly recommend an independent
verification (e.g., to make sure I did not drop some 4.0 character
in preparing this document).
Document History
Author: Eric Muller
Revision | Date | Comments |
1 | February 2, 2004 | First version |