L2/09-029 R2

Script Edge Cases

Mark Davis, 2009-01-30 (R2)
Live doc: https://docs.google.com/Doc?id=dfqr8rd5_371fzqg88g8

In most cases, the assignment of scripts works quite well. There are, however, some edge cases that people may stumble over (as I did). I suggest that we document such cases in a new section of TR24.

In particular, the characters that are associated with multiple scripts may need to be grouped with each of the scripts in particular applications. (See, for example, the mockup at http://macchiato.com/picker/MyApplication.html.)

The following is a suggested list of characters for such a section, with guesses as to how to document them. (I make give some suggested property changes, but even if we don't make any, I think it is important to document the situation in one place.) The characters are all listed here for discussion - for the actual text they could be more compactly represented by ranges.

The tentative values I have are marked with @..., just to make it easy to extract the information in tools. As I said before,

Note: if a PDF is in the doc registry, an HTML version should be there also, so that the links work. In the notes below, "explicit script" means a script other than Common or Inherited.



@Latin

(inc. Phonetic alphabets)

Am guessing that these are functionally Latin script (including phonetic alphabets like IPA, UPA). Are they used with other scripts? Cyrillic? Greek?

Basic Latin - ASCII punctuation and symbols

U+005E ( ^ ) CIRCUMFLEX ACCENT
U+0060 ( ` ) GRAVE ACCENT

Latin 1 Supplement - Latin-1 punctuation and symbols

U+00A8 ( ¨ ) DIAERESIS
U+00AF ( ¯ ) MACRON
U+00B4 ( ´ ) ACUTE ACCENT
U+00B8 ( ¸ ) CEDILLA

Spacing Modifier Letters - Miscellaneous phonetic modifiers

U+02B9 ( ʹ ) MODIFIER LETTER PRIME
U+02BA ( ʺ ) MODIFIER LETTER DOUBLE PRIME
U+02BB ( ʻ ) MODIFIER LETTER TURNED COMMA
U+02BC ( ʼ ) MODIFIER LETTER APOSTROPHE
U+02BD ( ʽ ) MODIFIER LETTER REVERSED COMMA
U+02BE ( ʾ ) MODIFIER LETTER RIGHT HALF RING
U+02BF ( ʿ ) MODIFIER LETTER LEFT HALF RING
U+02C0 ( ˀ ) MODIFIER LETTER GLOTTAL STOP
U+02C1 ( ˁ ) MODIFIER LETTER REVERSED GLOTTAL STOP
U+02C2 ( ˂ ) MODIFIER LETTER LEFT ARROWHEAD
U+02C3 ( ˃ ) MODIFIER LETTER RIGHT ARROWHEAD
U+02C4 ( ˄ ) MODIFIER LETTER UP ARROWHEAD
U+02C5 ( ˅ ) MODIFIER LETTER DOWN ARROWHEAD
U+02C6 ( ˆ ) MODIFIER LETTER CIRCUMFLEX ACCENT
U+02C7 ( ˇ ) CARON
U+02C8 ( ˈ ) MODIFIER LETTER VERTICAL LINE
U+02C9 ( ˉ ) MODIFIER LETTER MACRON
U+02CA ( ˊ ) MODIFIER LETTER ACUTE ACCENT
U+02CB ( ˋ ) MODIFIER LETTER GRAVE ACCENT
U+02CC ( ˌ ) MODIFIER LETTER LOW VERTICAL LINE
U+02CD ( ˍ ) MODIFIER LETTER LOW MACRON
U+02CE ( ˎ ) MODIFIER LETTER LOW GRAVE ACCENT
U+02CF ( ˏ ) MODIFIER LETTER LOW ACUTE ACCENT
U+02D0 ( ː ) MODIFIER LETTER TRIANGULAR COLON
U+02D1 ( ˑ ) MODIFIER LETTER HALF TRIANGULAR COLON
U+02D2 ( ˒ ) MODIFIER LETTER CENTRED RIGHT HALF RING
U+02D3 ( ˓ ) MODIFIER LETTER CENTRED LEFT HALF RING
U+02D4 ( ˔ ) MODIFIER LETTER UP TACK
U+02D5 ( ˕ ) MODIFIER LETTER DOWN TACK
U+02D6 ( ˖ ) MODIFIER LETTER PLUS SIGN
U+02D7 ( ˗ ) MODIFIER LETTER MINUS SIGN

Spacing Modifier Letters - Spacing clones of diacritics

U+02D8 ( ˘ ) BREVE
U+02D9 ( ˙ ) DOT ABOVE
U+02DA ( ˚ ) RING ABOVE
U+02DB ( ˛ ) OGONEK
U+02DC ( ˜ ) SMALL TILDE
U+02DD ( ˝ ) DOUBLE ACUTE ACCENT

Spacing Modifier Letters - Additions based on 1989 IPA

U+02DE ( ˞ ) MODIFIER LETTER RHOTIC HOOK
U+02DF ( ˟ ) MODIFIER LETTER CROSS ACCENT

Spacing Modifier Letters - Tone letters

U+02E5 ( ˥ ) MODIFIER LETTER EXTRA-HIGH TONE BAR
U+02E6 ( ˦ ) MODIFIER LETTER HIGH TONE BAR
U+02E7 ( ˧ ) MODIFIER LETTER MID TONE BAR
U+02E8 ( ˨ ) MODIFIER LETTER LOW TONE BAR
U+02E9 ( ˩ ) MODIFIER LETTER EXTRA-LOW TONE BAR

Spacing Modifier Letters - IPA modifiers

U+02EC ( ˬ ) MODIFIER LETTER VOICING
U+02ED ( ˭ ) MODIFIER LETTER UNASPIRATED

Spacing Modifier Letters - Other modifier letter

U+02EE ( ˮ ) MODIFIER LETTER DOUBLE APOSTROPHE

(The following set appears to be for use in Latin/IPA according to WG2 docs)

Modifier Tone Letters - Corner tone marks for Chinese

U+A700 ( ꜀ ) MODIFIER LETTER CHINESE TONE YIN PING
U+A701 ( ꜁ ) MODIFIER LETTER CHINESE TONE YANG PING
U+A702 ( ꜂ ) MODIFIER LETTER CHINESE TONE YIN SHANG
U+A703 ( ꜃ ) MODIFIER LETTER CHINESE TONE YANG SHANG
U+A704 ( ꜄ ) MODIFIER LETTER CHINESE TONE YIN QU
U+A705 ( ꜅ ) MODIFIER LETTER CHINESE TONE YANG QU
U+A706 ( ꜆ ) MODIFIER LETTER CHINESE TONE YIN RU
U+A707 ( ꜇ ) MODIFIER LETTER CHINESE TONE YANG RU

Modifier Tone Letters - Dotted tone letters

U+A708 ( ꜈ ) MODIFIER LETTER EXTRA-HIGH DOTTED TONE BAR
U+A709 ( ꜉ ) MODIFIER LETTER HIGH DOTTED TONE BAR
U+A70A ( ꜊ ) MODIFIER LETTER MID DOTTED TONE BAR
U+A70B ( ꜋ ) MODIFIER LETTER LOW DOTTED TONE BAR
U+A70C ( ꜌ ) MODIFIER LETTER EXTRA-LOW DOTTED TONE BAR
U+A70D ( ꜍ ) MODIFIER LETTER EXTRA-HIGH DOTTED LEFT-STEM TONE BAR
U+A70E ( ꜎ ) MODIFIER LETTER HIGH DOTTED LEFT-STEM TONE BAR
U+A70F ( ꜏ ) MODIFIER LETTER MID DOTTED LEFT-STEM TONE BAR
U+A710 ( ꜐ ) MODIFIER LETTER LOW DOTTED LEFT-STEM TONE BAR
U+A711 ( ꜑ ) MODIFIER LETTER EXTRA-LOW DOTTED LEFT-STEM TONE BAR

Modifier Tone Letters - Left-stem tone letters

U+A712 ( ꜒ ) MODIFIER LETTER EXTRA-HIGH LEFT-STEM TONE BAR
U+A713 ( ꜓ ) MODIFIER LETTER HIGH LEFT-STEM TONE BAR
U+A714 ( ꜔ ) MODIFIER LETTER MID LEFT-STEM TONE BAR
U+A715 ( ꜕ ) MODIFIER LETTER LOW LEFT-STEM TONE BAR
U+A716 ( ꜖ ) MODIFIER LETTER EXTRA-LOW LEFT-STEM TONE BAR

Modifier Tone Letters - Chinantec tone marks

U+A717 ( ꜗ ) MODIFIER LETTER DOT VERTICAL BAR
U+A718 ( ꜘ ) MODIFIER LETTER DOT SLASH
U+A719 ( ꜙ ) MODIFIER LETTER DOT HORIZONTAL BAR
U+A71A ( ꜚ ) MODIFIER LETTER LOWER RIGHT CORNER ANGLE

Modifier Tone Letters - Africanist tone letters

U+A71B ( ꜛ ) MODIFIER LETTER RAISED UP ARROW
U+A71C ( ꜜ ) MODIFIER LETTER RAISED DOWN ARROW
U+A71D ( ꜝ ) MODIFIER LETTER RAISED EXCLAMATION MARK
U+A71E ( ꜞ ) MODIFIER LETTER RAISED INVERTED EXCLAMATION MARK
U+A71F ( ꜟ ) MODIFIER LETTER LOW INVERTED EXCLAMATION MARK

Latin Extended D - Modifier letters

U+A788 ( ꞈ ) MODIFIER LETTER LOW CIRCUMFLEX ACCENT
U+A789 ( ꞉ ) MODIFIER LETTER COLON
U+A78A ( ꞊ ) MODIFIER LETTER SHORT EQUALS SIGN

Spacing Modifier Letters - UPA modifiers

U+02EF ( ˯ ) MODIFIER LETTER LOW DOWN ARROWHEAD
U+02F0 ( ˰ ) MODIFIER LETTER LOW UP ARROWHEAD
U+02F1 ( ˱ ) MODIFIER LETTER LOW LEFT ARROWHEAD
U+02F2 ( ˲ ) MODIFIER LETTER LOW RIGHT ARROWHEAD
U+02F3 ( ˳ ) MODIFIER LETTER LOW RING
U+02F4 ( ˴ ) MODIFIER LETTER MIDDLE GRAVE ACCENT
U+02F5 ( ˵ ) MODIFIER LETTER MIDDLE DOUBLE GRAVE ACCENT
U+02F6 ( ˶ ) MODIFIER LETTER MIDDLE DOUBLE ACUTE ACCENT
U+02F7 ( ˷ ) MODIFIER LETTER LOW TILDE
U+02F8 ( ˸ ) MODIFIER LETTER RAISED COLON
U+02F9 ( ˹ ) MODIFIER LETTER BEGIN HIGH TONE
U+02FA ( ˺ ) MODIFIER LETTER END HIGH TONE
U+02FB ( ˻ ) MODIFIER LETTER BEGIN LOW TONE
U+02FC ( ˼ ) MODIFIER LETTER END LOW TONE
U+02FD ( ˽ ) MODIFIER LETTER SHELF
U+02FE ( ˾ ) MODIFIER LETTER OPEN SHELF
U+02FF ( ˿ ) MODIFIER LETTER LOW LEFT ARROW

Latin Extended D - Additions for UPA

U+A720 ( ꜠ ) MODIFIER LETTER STRESS AND HIGH TONE
U+A721 ( ꜡ ) MODIFIER LETTER STRESS AND LOW TONE

@Latin, Greek, Cyrillic

While the following have the form of Greek or Cyrillic letters, they are functionally Latin/Phonetic, which should be noted.

Phonetic Extensions - Greek letters

U+1D26 ( ᴦ ) GREEK LETTER SMALL CAPITAL GAMMA
U+1D27 ( ᴧ ) GREEK LETTER SMALL CAPITAL LAMDA
U+1D28 ( ᴨ ) GREEK LETTER SMALL CAPITAL PI
U+1D29 ( ᴩ ) GREEK LETTER SMALL CAPITAL RHO
U+1D2A ( ᴪ ) GREEK LETTER SMALL CAPITAL PSI

Phonetic Extensions - Cyrillic letter

U+1D2B ( ᴫ ) CYRILLIC LETTER SMALL CAPITAL EL

Phonetic Extensions - Greek superscript modifier letters

U+1D5D ( ᵝ ) MODIFIER LETTER SMALL BETA
U+1D5E ( ᵞ ) MODIFIER LETTER SMALL GREEK GAMMA
U+1D5F ( ᵟ ) MODIFIER LETTER SMALL DELTA
U+1D60 ( ᵠ ) MODIFIER LETTER SMALL GREEK PHI
U+1D61 ( ᵡ ) MODIFIER LETTER SMALL CHI

Phonetic Extensions - Greek subscript modifier letters

U+1D66 ( ᵦ ) GREEK SUBSCRIPT SMALL LETTER BETA
U+1D67 ( ᵧ ) GREEK SUBSCRIPT SMALL LETTER GAMMA
U+1D68 ( ᵨ ) GREEK SUBSCRIPT SMALL LETTER RHO
U+1D69 ( ᵩ ) GREEK SUBSCRIPT SMALL LETTER PHI
U+1D6A ( ᵪ ) GREEK SUBSCRIPT SMALL LETTER CHI

Phonetic Extensions - Caucasian linguistics

U+1D78 ( ᵸ ) MODIFIER LETTER CYRILLIC EN

Phonetic Extensions Supplement - Modifier letters

U+1DBF ( ᶿ ) MODIFIER LETTER SMALL THETA

@Greek

These appear to have no explicit script just because they map to general punctuation marks or modifier letters.

Greek And Coptic - Numeral signs

U+0374 ( ʹ ) GREEK NUMERAL SIGN

Greek And Coptic - Punctuation

U+037E ( ; ) GREEK QUESTION MARK

Greek And Coptic - Spacing accent marks

U+0385 ( ΅ ) GREEK DIALYTIKA TONOS

Greek And Coptic - Punctuation

U+0387 ( · ) GREEK ANO TELEIA

In contrast, the following does have an explicit script, and is the only Sk (Modifier_Symbol) that does. It is also odd because it is Sk, while the corresponding U+0374 is a Modifier_Letter.

U+0375 ( ͵ ) GREEK LOWER NUMERAL SIGN

@Armenian, Georgian

Armenian - Punctuation

U+0589 ( ։ ) ARMENIAN FULL STOP

@Arabic

Arabic - Subtending marks

U+0600 ( ؀ ) ARABIC NUMBER SIGN
U+0601 ( ؁ ) ARABIC SIGN SANAH
U+0602 ( ؂ ) ARABIC FOOTNOTE MARKER
U+0603 ( ؃ ) ARABIC SIGN SAFHA

Arabic Presentation Forms A - Symbol

U+FDFD ( ﷽ ) ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM

@Arabic, Thaana

Arabic - Arabic-Indic digits

(Note that the U+06Fx EXTENDED ARABIC-INDIC DIGIT x characters have already the specific script Arabic)
U+0660 ( ٠ ) ARABIC-INDIC DIGIT ZERO
U+0661 ( ١ ) ARABIC-INDIC DIGIT ONE
U+0662 ( ٢ ) ARABIC-INDIC DIGIT TWO
U+0663 ( ٣ ) ARABIC-INDIC DIGIT THREE
U+0664 ( ٤ ) ARABIC-INDIC DIGIT FOUR
U+0665 ( ٥ ) ARABIC-INDIC DIGIT FIVE
U+0666 ( ٦ ) ARABIC-INDIC DIGIT SIX
U+0667 ( ٧ ) ARABIC-INDIC DIGIT SEVEN
U+0668 ( ٨ ) ARABIC-INDIC DIGIT EIGHT
U+0669 ( ٩ ) ARABIC-INDIC DIGIT NINE

@Arabic, Syriac, Thaana

Arabic - Punctuation

U+060C ( ، ) ARABIC COMMA
U+061B ( ‎؛‎ ) ARABIC SEMICOLON
U+061F ( ‎؟‎ ) ARABIC QUESTION MARK

@Common

Arabic - Koranic annotation signs

U+06DD ( ۝ ) ARABIC END OF AYAH

@Arabic, Syriac

Are there any others of these that are not used with Syriac?

Arabic - Based on ISO 8859-6

U+0640 ( ‎ـ‎ ) ARABIC TATWEEL

Arabic - Points from ISO 8859-6

U+064B ( ً ) ARABIC FATHATAN
U+064C ( ٌ ) ARABIC DAMMATAN
U+064D ( ٍ ) ARABIC KASRATAN
U+064E ( َ ) ARABIC FATHA
U+064F ( ُ ) ARABIC DAMMA
U+0650 ( ِ ) ARABIC KASRA
U+0651 ( ّ ) ARABIC SHADDA
U+0652 ( ْ ) ARABIC SUKUN

Arabic - Combining maddah and hamza

U+0653 ( ٓ ) ARABIC MADDAH ABOVE
U+0654 ( ٔ ) ARABIC HAMZA ABOVE
U+0655 ( ٕ ) ARABIC HAMZA BELOW

Arabic - Point

U+0670 ( ٰ ) ARABIC LETTER SUPERSCRIPT ALEF

@Bopomofo

These appear to be just Bopomofo script

Spacing Modifier Letters - Extended Bopomofo tone marks

U+02EA ( ˪ ) MODIFIER LETTER YIN DEPARTING TONE MARK
U+02EB ( ˫ ) MODIFIER LETTER YANG DEPARTING TONE MARK

@Devanagari

Am guessing these are Devanagari script.

Devanagari - Various signs

U+0951 ( ॑ ) DEVANAGARI STRESS SIGN UDATTA
U+0952 ( ॒ ) DEVANAGARI STRESS SIGN ANUDATTA

Devanagari - Devanagari-specific additions

U+0970 ( ॰ ) DEVANAGARI ABBREVIATION SIGN

@Devanagari, Bengali, Gurmukhi, Gujarati, Oriya, Tamil, Telugu, Kannada, Malayalam

The annotations say "scripts of India" for the first set, and "The Vedic signs for jihvamuliya and upadhmaniya were encoded in the Kannada block, but are intended for general Vedic use with all scripts", but the former probably doesn't include Arabic (Urdu), and the latter probably also means "with all Indic scripts". Am guessing that "Indic script" means the above.

Devanagari - Generic punctuation for scripts of India

U+0964 ( । ) DEVANAGARI DANDA
U+0965 ( ॥ ) DEVANAGARI DOUBLE DANDA

Kannada - Vedic signs

U+0CF1 ( ೱ ) KANNADA SIGN JIHVAMULIYA
U+0CF2 ( ೲ ) KANNADA SIGN UPADHMANIYA

@Georgian, Latin, Cyrillic, Greek, Coptic

Note: historic

Georgian - Punctuation

U+10FB ( ჻ ) GEORGIAN PARAGRAPH SEPARATOR

@Runic

Am guessing that the following should be Runic script.

Runic - Punctuation

U+16EB ( ᛫ ) RUNIC SINGLE PUNCTUATION
U+16EC ( ᛬ ) RUNIC MULTIPLE PUNCTUATION
U+16ED ( ᛭ ) RUNIC CROSS PUNCTUATION

@Hanunoo, Tagalog, Buhid, Tagbanwa

Don't know exactly what "Philippine scripts" is supposed to be; am guessing the above.

Hanunoo - Generic punctuation for Philippine scripts

U+1735 ( ᜵ ) PHILIPPINE SINGLE PUNCTUATION
U+1736 ( ᜶ ) PHILIPPINE DOUBLE PUNCTUATION

@Mongolian, Phags-Pa

Am guessing that the following should be Mongolian script.

Mongolian - Punctuation

U+1802 ( ᠂ ) MONGOLIAN COMMA
U+1803 ( ᠃ ) MONGOLIAN FULL STOP
U+1805 ( ᠅ ) MONGOLIAN FOUR DOTS

@Hiragana, Katakana

Think these are pretty clearly just Hiragana and Katakana.

CJK Symbols And Punctuation - Other CJK symbols

U+3031 ( 〱 ) VERTICAL KANA REPEAT MARK
U+3032 ( 〲 ) VERTICAL KANA REPEAT WITH VOICED SOUND MARK
U+3033 ( 〳 ) VERTICAL KANA REPEAT MARK UPPER HALF
U+3034 ( 〴 ) VERTICAL KANA REPEAT WITH VOICED SOUND MARK UPPER HALF
U+3035 ( 〵 ) VERTICAL KANA REPEAT MARK LOWER HALF

Hiragana - Voicing marks

U+3099 ( ゙ ) COMBINING KATAKANA-HIRAGANA VOICED SOUND MARK
U+309A ( ゚ ) COMBINING KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK
U+309B ( ゛ ) KATAKANA-HIRAGANA VOICED SOUND MARK
U+309C ( ゜ ) KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK

Katakana - Katakana punctuation

U+30A0 ( ゠ ) KATAKANA-HIRAGANA DOUBLE HYPHEN

Katakana - Conjunction and length marks

U+30FB ( ・ ) KATAKANA MIDDLE DOT
U+30FC ( ー ) KATAKANA-HIRAGANA PROLONGED SOUND MARK

Halfwidth And Fullwidth Forms - Halfwidth Katakana variants

U+FF65 ( ・ ) HALFWIDTH KATAKANA MIDDLE DOT
U+FF70 ( ー ) HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK
U+FF9E ( ゙ ) HALFWIDTH KATAKANA VOICED SOUND MARK
U+FF9F ( ゚ ) HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK

@Hangul

CJK Symbols And Punctuation - Diacritics

U+302E ( 〮 ) HANGUL SINGLE DOT TONE MARK
U+302F ( 〯 ) HANGUL DOUBLE DOT TONE MARK

@Han

Later add Tangut, Jurchen, Khitan, once encoded.
[:Block=Kanbun:]
[:Block=Ideographic_Description_Characters:]

CJK Strokes - CJK strokes

U+31C0 ( ㇀ ) CJK STROKE T
U+31C1 ( ㇁ ) CJK STROKE WG
U+31C2 ( ㇂ ) CJK STROKE XG
U+31C3 ( ㇃ ) CJK STROKE BXG
U+31C4 ( ㇄ ) CJK STROKE SW
U+31C5 ( ㇅ ) CJK STROKE HZZ
U+31C6 ( ㇆ ) CJK STROKE HZG
U+31C7 ( ㇇ ) CJK STROKE HP
U+31C8 ( ㇈ ) CJK STROKE HZWG
U+31C9 ( ㇉ ) CJK STROKE SZWG
U+31CA ( ㇊ ) CJK STROKE HZT
U+31CB ( ㇋ ) CJK STROKE HZZP
U+31CC ( ㇌ ) CJK STROKE HPWG
U+31CD ( ㇍ ) CJK STROKE HZW
U+31CE ( ㇎ ) CJK STROKE HZZZ
U+31CF ( ㇏ ) CJK STROKE N
U+31D0 ( ㇐ ) CJK STROKE H
U+31D1 ( ㇑ ) CJK STROKE S
U+31D2 ( ㇒ ) CJK STROKE P
U+31D3 ( ㇓ ) CJK STROKE SP
U+31D4 ( ㇔ ) CJK STROKE D
U+31D5 ( ㇕ ) CJK STROKE HZ
U+31D6 ( ㇖ ) CJK STROKE HG
U+31D7 ( ㇗ ) CJK STROKE SZ
U+31D8 ( ㇘ ) CJK STROKE SWZ
U+31D9 ( ㇙ ) CJK STROKE ST
U+31DA ( ㇚ ) CJK STROKE SG
U+31DB ( ㇛ ) CJK STROKE PD
U+31DC ( ㇜ ) CJK STROKE PZ
U+31DD ( ㇝ ) CJK STROKE TN
U+31DE ( ㇞ ) CJK STROKE SZZ
U+31DF ( ㇟ ) CJK STROKE SWG
U+31E0 ( ㇠ ) CJK STROKE HXWG
U+31E1 ( ㇡ ) CJK STROKE HZZZG
U+31E2 ( ㇢ ) CJK STROKE PG
U+31E3 ( ㇣ ) CJK STROKE Q

@Han, Hangul, Hiragana, Katakana, Bopomofo, Yi, Phags-pa, Tibetan

CJK Symbols And Punctuation - CJK symbols and punctuation

U+3001 ( 、 ) IDEOGRAPHIC COMMA
U+3002 ( 。 ) IDEOGRAPHIC FULL STOP

Halfwidth And Fullwidth Forms - Halfwidth CJK punctuation

U+FF61 ( 。 ) HALFWIDTH IDEOGRAPHIC FULL STOP
U+FF64 ( 、 ) HALFWIDTH IDEOGRAPHIC COMMA

@Hiragana, Katakana

Jpan = Japanese (alias for Han + Hiragana + Katakana)

CJK Symbols And Punctuation - CJK symbols

U+3012 ( 〒 ) POSTAL MARK

@Hangul

Kore = Korean (alias for Hangul + Han)

Enclosed CJK Letters And Months - Symbol

U+327F ( ㉿ ) KOREAN STANDARD SYMBOL

Han, Hangul, Hiragana, Katakana, Bopomofo

For easier comparison, these are also broken down by General Category

General-Category=Punctuation

CJK Symbols And Punctuation - CJK symbols and punctuation

U+3001 ( 、 ) IDEOGRAPHIC COMMA
U+3002 ( 。 ) IDEOGRAPHIC FULL STOP
U+3003 ( 〃 ) DITTO MARK
U+301C ( 〜 ) WAVE DASH
U+301D ( 〝 ) REVERSED DOUBLE PRIME QUOTATION MARK
U+301E ( 〞 ) DOUBLE PRIME QUOTATION MARK
U+301F ( 〟 ) LOW DOUBLE PRIME QUOTATION MARK

CJK Symbols And Punctuation - Other CJK symbols

U+3030 ( 〰 ) WAVY DASH

CJK Symbols And Punctuation - Other CJK punctuation

U+303D ( 〽 ) PART ALTERNATION MARK

CJK Compatibility Forms - Sidelining emphasis marks

U+FE45 ( ﹅ ) SESAME DOT
U+FE46 ( ﹆ ) WHITE SESAME DOT

General-Category=Symbol

CJK Symbols And Punctuation - CJK symbols and punctuation

U+3004 ( 〄 ) JAPANESE INDUSTRIAL STANDARD SYMBOL

CJK Symbols And Punctuation - CJK symbols and punctuation

U+3020 ( 〠 ) POSTAL MARK FACE

CJK Symbols And Punctuation - CJK symbols

U+3013 ( 〓 ) GETA MARK

CJK Symbols And Punctuation - Other CJK symbols

U+3037 ( 〷 ) IDEOGRAPHIC TELEGRAPH LINE FEED SEPARATOR SYMBOL

CJK Symbols And Punctuation - Special CJK indicators

U+303E ( 〾 ) IDEOGRAPHIC VARIATION INDICATOR
U+303F ( 〿 ) IDEOGRAPHIC HALF FILL SPACE

General Category=Letter

CJK Symbols And Punctuation - CJK symbols and punctuation

U+3006 ( 〆 ) IDEOGRAPHIC CLOSING MARK

CJK Symbols And Punctuation - Other CJK punctuation

U+303C ( 〼 ) MASU MARK

@Common

Han, Hangul, Hiragana, Katakana, Bopomofo, Yi, Phags-pa, Tibetan, other scripts of China, but probably just better treated as Common

CJK Symbols And Punctuation - CJK angle brackets

U+3008 ( 〈 ) LEFT ANGLE BRACKET
U+3009 ( 〉 ) RIGHT ANGLE BRACKET
U+300A ( 《 ) LEFT DOUBLE ANGLE BRACKET
U+300B ( 》 ) RIGHT DOUBLE ANGLE BRACKET

U+300C ( 「 ) LEFT CORNER BRACKET
U+300D ( 」 ) RIGHT CORNER BRACKET
U+300E ( 『 ) LEFT WHITE CORNER BRACKET
U+300F ( 』 ) RIGHT WHITE CORNER BRACKET

U+3010 ( 【 ) LEFT BLACK LENTICULAR BRACKET
U+3011 ( 】 ) RIGHT BLACK LENTICULAR BRACKET

U+3014 ( 〔 ) LEFT TORTOISE SHELL BRACKET
U+3015 ( 〕 ) RIGHT TORTOISE SHELL BRACKET
U+3016 ( 〖 ) LEFT WHITE LENTICULAR BRACKET
U+3017 ( 〗 ) RIGHT WHITE LENTICULAR BRACKET
U+3018 ( 〘 ) LEFT WHITE TORTOISE SHELL BRACKET
U+3019 ( 〙 ) RIGHT WHITE TORTOISE SHELL BRACKET
U+301A ( 〚 ) LEFT WHITE SQUARE BRACKET
U+301B ( 〛 ) RIGHT WHITE SQUARE BRACKET

@Han, Bopomofo

CJK Symbols And Punctuation - Diacritics

U+302A ( 〪 ) IDEOGRAPHIC LEVEL TONE MARK
U+302B ( 〫 ) IDEOGRAPHIC RISING TONE MARK
U+302C ( 〬 ) IDEOGRAPHIC DEPARTING TONE MARK
U+302D ( 〭 ) IDEOGRAPHIC ENTERING TONE MARK

Letterlike symbols

@No-Change

The following have specific scripts:

Arabic - Letterlike symbol

U+0608 ( ‎؈‎ ) ARABIC RAY

Letterlike Symbols - Letterlike symbols

U+2126 ( Ω ) OHM SIGN
U+212A ( K ) KELVIN SIGN
U+212B ( Å ) ANGSTROM SIGN
U+2132 ( Ⅎ ) TURNED CAPITAL F

Letterlike Symbols - Lowercase Claudian letter

U+214E ( ⅎ ) TURNED SMALL F

While the following -- including a number that are apparently similar -- do not. (Math characters removed). Guessing these should be treated like Latin

@Latin

Letterlike Symbols - Letterlike symbols

U+2100 ( ℀ ) ACCOUNT OF
U+2101 ( ℁ ) ADDRESSED TO THE SUBJECT
U+2103 ( ℃ ) DEGREE CELSIUS
U+2104 ( ℄ ) CENTRE LINE SYMBOL
U+2105 ( ℅ ) CARE OF
U+2106 ( ℆ ) CADA UNA
U+2107 ( ℇ ) EULER CONSTANT
U+2108 ( ℈ ) SCRUPLE
U+2109 ( ℉ ) DEGREE FAHRENHEIT
U+2114 ( ℔ ) L B BAR SYMBOL
U+2116 ( № ) NUMERO SIGN
U+2117 ( ℗ ) SOUND RECORDING COPYRIGHT
U+2118 ( ℘ ) SCRIPT CAPITAL P
U+211E ( ℞ ) PRESCRIPTION TAKE
U+211F ( ℟ ) RESPONSE
U+2120 ( ℠ ) SERVICE MARK
U+2121 ( ℡ ) TELEPHONE SIGN
U+2122 ( ™ ) TRADE MARK SIGN
U+2123 ( ℣ ) VERSICLE
U+2125 ( ℥ ) OUNCE SIGN
U+2127 ( ℧ ) INVERTED OHM SIGN
U+212E ( ℮ ) ESTIMATED SYMBOL

Letterlike Symbols - Additional letterlike symbols

U+2139 ( ℹ ) INFORMATION SOURCE
U+213A ( ℺ ) ROTATED CAPITAL Q
U+213B ( ℻ ) FACSIMILE SIGN
U+214A ( ⅊ ) PROPERTY LINE
U+214C ( ⅌ ) PER SIGN
U+214D ( ⅍ ) AKTIESELSKAB

Math Symbols with specific scripts

The following are the only Sm (Math_Symbol) with explicit scripts (856 GC=Sm characters don't, and 1,027 Math=true characters don't) or are Letterlike-symbols. The Arabic ones seem ok (if they are not used in Syriac, etc). Should the Greek one be Common script?

@Common

Greek And Coptic - Variant letterforms and symbols

U+03F6 ( ϶ ) GREEK REVERSED LUNATE EPSILON SYMBOL

@No-Change

Arabic - Radix symbols

U+0606 ( ؆ ) ARABIC-INDIC CUBE ROOT
U+0607 ( ؇ ) ARABIC-INDIC FOURTH ROOT

Arabic - Letterlike symbol

U+0608 ( ‎؈‎ ) ARABIC RAY

Letterlike Symbols - Double-struck large operator

U+2140 ( ⅀ ) DOUBLE-STRUCK N-ARY SUMMATION

Letterlike Symbols - Additional letterlike symbols

U+2141 ( ⅁ ) TURNED SANS-SERIF CAPITAL G
U+2142 ( ⅂ ) TURNED SANS-SERIF CAPITAL L
U+2143 ( ⅃ ) REVERSED SANS-SERIF CAPITAL L
U+2144 ( ⅄ ) TURNED SANS-SERIF CAPITAL Y
U+214B ( ⅋ ) TURNED AMPERSAND


Characters whose canonical equivalents don't match in script

[Ed note: see also actions 110-A092, and 113-A036, and L2/07-071 ]

The following characters are Script=Greek, but their canonical equivalents are Script=Common. These are the only such characters that change from an explicit script to Common.

Greek Extended - Precomposed polytonic Greek

U+1FC1 ( ῁ ) GREEK DIALYTIKA AND PERISPOMENI
U+1FED ( ῭ ) GREEK DIALYTIKA AND VARIA
U+1FEE ( ΅ ) GREEK DIALYTIKA AND OXIA
U+1FEF ( ` ) GREEK VARIA
U+1FFD ( ´ ) GREEK OXIA

This is not an issue for the other Modifier Symbols (Sk) in Greek blocks:

Greek And Coptic - Numeral signs

U+0375 ( ͵ ) GREEK LOWER NUMERAL SIGN // the only one without a compat decomp.

Greek And Coptic - Spacing accent marks

U+0384 ( ΄ ) GREEK TONOS
U+0385 ( ΅ ) GREEK DIALYTIKA TONOS

Greek Extended - Precomposed polytonic Greek

U+1FBD ( ᾽ ) GREEK KORONIS
U+1FBF ( ᾿ ) GREEK PSILI
U+1FC0 ( ῀ ) GREEK PERISPOMENI
U+1FCD ( ῍ ) GREEK PSILI AND VARIA
U+1FCE ( ῎ ) GREEK PSILI AND OXIA
U+1FCF ( ῏ ) GREEK PSILI AND PERISPOMENI
U+1FDD ( ῝ ) GREEK DASIA AND VARIA
U+1FDE ( ῞ ) GREEK DASIA AND OXIA
U+1FDF ( ῟ ) GREEK DASIA AND PERISPOMENI
U+1FFE ( ῾ ) GREEK DASIA

Or the other Modifier Letters (Lm) in Greek blocks:

Greek And Coptic - Numeral signs

U+0374 ( ʹ ) GREEK NUMERAL SIGN

Greek And Coptic - Iota subscript

U+037A ( ͺ ) GREEK YPOGEGRAMMENI

While not having to do with Script, I ran across the following also:

Mc (Combining Marks) without script

These would probably be better as Sk (Modifier_Symbol). We should at least document them as the only cases of Mc that are not letter-like.

@Sk

Musical Symbols - Stems

U+1D165 ( 𝅥 ) MUSICAL SYMBOL COMBINING STEM
U+1D166 ( 𝅦 ) MUSICAL SYMBOL COMBINING SPRECHGESANG STEM

Musical Symbols - Augmentation dot

U+1D16D ( 𝅭 ) MUSICAL SYMBOL COMBINING AUGMENTATION DOT

Musical Symbols - Flags

U+1D16E ( 𝅮 ) MUSICAL SYMBOL COMBINING FLAG-1
U+1D16F ( 𝅯 ) MUSICAL SYMBOL COMBINING FLAG-2
U+1D170 ( 𝅰 ) MUSICAL SYMBOL COMBINING FLAG-3
U+1D171 ( 𝅱 ) MUSICAL SYMBOL COMBINING FLAG-4
U+1D172 ( 𝅲 ) MUSICAL SYMBOL COMBINING FLAG-5

Math Symbols not marked with Sm

While Letter in form, these all should behave like Sm; they don't want to case fold, or be treated as parts of words.

@Sm

[:block=Mathematical Alphanumeric Symbols:]
// plus [[:subhead=/(?i)letterlike/:][:block=/(?i)letterlike/:]&[:math:]-[:sm:]]

Letterlike Symbols - Letterlike symbols

U+2102 ( ℂ ) DOUBLE-STRUCK CAPITAL C
U+210A ( ℊ ) SCRIPT SMALL G
U+210B ( ℋ ) SCRIPT CAPITAL H
U+210C ( ℌ ) BLACK-LETTER CAPITAL H
U+210D ( ℍ ) DOUBLE-STRUCK CAPITAL H
U+210E ( ℎ ) PLANCK CONSTANT
U+210F ( ℏ ) PLANCK CONSTANT OVER TWO PI
U+2110 ( ℐ ) SCRIPT CAPITAL I
U+2111 ( ℑ ) BLACK-LETTER CAPITAL I
U+2112 ( ℒ ) SCRIPT CAPITAL L
U+2113 ( ℓ ) SCRIPT SMALL L
U+2115 ( ℕ ) DOUBLE-STRUCK CAPITAL N
U+2119 ( ℙ ) DOUBLE-STRUCK CAPITAL P
U+211A ( ℚ ) DOUBLE-STRUCK CAPITAL Q
U+211B ( ℛ ) SCRIPT CAPITAL R
U+211C ( ℜ ) BLACK-LETTER CAPITAL R
U+211D ( ℝ ) DOUBLE-STRUCK CAPITAL R
U+2124 ( ℤ ) DOUBLE-STRUCK CAPITAL Z
U+2128 ( ℨ ) BLACK-LETTER CAPITAL Z
U+2129 ( ℩ ) TURNED GREEK SMALL LETTER IOTA
U+212C ( ℬ ) SCRIPT CAPITAL B
U+212D ( ℭ ) BLACK-LETTER CAPITAL C
U+212F ( ℯ ) SCRIPT SMALL E
U+2130 ( ℰ ) SCRIPT CAPITAL E
U+2131 ( ℱ ) SCRIPT CAPITAL F
U+2133 ( ℳ ) SCRIPT CAPITAL M
U+2134 ( ℴ ) SCRIPT SMALL O

Letterlike Symbols - Hebrew letterlike math symbols

U+2135 ( ℵ ) ALEF SYMBOL
U+2136 ( ℶ ) BET SYMBOL
U+2137 ( ℷ ) GIMEL SYMBOL
U+2138 ( ℸ ) DALET SYMBOL

Letterlike Symbols - Additional letterlike symbols

U+213C ( ℼ ) DOUBLE-STRUCK SMALL PI
U+213D ( ℽ ) DOUBLE-STRUCK SMALL GAMMA
U+213E ( ℾ ) DOUBLE-STRUCK CAPITAL GAMMA
U+213F ( ℿ ) DOUBLE-STRUCK CAPITAL PI

Letterlike Symbols - Double-struck italic math symbols

U+2145 ( ⅅ ) DOUBLE-STRUCK ITALIC CAPITAL D
U+2146 ( ⅆ ) DOUBLE-STRUCK ITALIC SMALL D
U+2147 ( ⅇ ) DOUBLE-STRUCK ITALIC SMALL E
U+2148 ( ⅈ ) DOUBLE-STRUCK ITALIC SMALL I
U+2149 ( ⅉ ) DOUBLE-STRUCK ITALIC SMALL J


Circled & Parenthesized

Note: all circled alphanumerics are So
ⓐⒶ ⓑⒷ ⓒⒸ ⓓⒹ ⓔⒺ ⓕⒻ ⓖ Ⓖ ⓗⒽ ⓘⒾ ⓙⒿ ⓚⓀ ⓛⓁ ⓜⓂ ⓝ Ⓝ ⓞⓄ ⓟⓅ ⓠⓆ ⓡⓇ ⓢⓈ ⓣⓉ ⓤ Ⓤ ⓥⓋ ⓦⓌ ⓧⓍ ⓨⓎ ⓩⓏ ㉠ ㉮ ㉡ ㉯ ㉢ ㉰ ㉣ ㉱ ㉤ ㉲ ㉥ ㉳ ㉦ ㉴ ㉧ ㉵ ㉾ ㉨ ㉶ ㉽ ㉩ ㉷ ㉼ ㉪ ㉸ ㉫ ㉹ ㉬ ㉺ ㉭ ㉻ ㋐-㋾ ㊤ ㊦ ㊥ ㊭ ㊡ ㊝ ㊢ ㊘ ㊩ ㊯ ㊞ ㊨ ㊔ ㊏ ㊰ ㊛ ㊫ ㊪ ㊧ ㊐ ㊊ ㊒ ㊍ ㊑ ㊣ ㊌ ㊟ ㊋ ㊕ ㊚ ㊬ ㊓ ㊗ ㊙ ㊖ ㊮ ㊜ ㊎ ㊠
except the following:
⓪ ① ⑩-⑲ ② ⑳ ㉑-㉙ ③ ㉚-㉟ ㊱-㊴ ④ ㊵-㊾ ⑤ ㊿ ⑥-⑨
 ㊀ ㊆ ㊂ ㊈ ㊁ ㊄ ㊇ ㊅ ㊉ ㊃

The Western numbers are understandable -- the uncircled values are gc=Number, but the uncircled Han characters are not.

Similarly, all parenthesized alphanums are So
⒜-⒵ ㈀ ㈎ ㈁ ㈏ ㈂ ㈐ ㈃ ㈑ ㈄ ㈒ ㈅ ㈓ ㈆ ㈔ ㈇ ㈕ ㈝ ㈞ ㈈ ㈖ ㈜ ㈉ ㈗ ㈊ ㈘ ㈋ ㈙ ㈌ ㈚ ㈍ ㈛ ㈹ ㈽ ㉁ ㈸ ㈿ ㈴ ㈺ ㈯ ㈻ ㈰ ㈪ ㈲ ㈭ ㈱ ㈬ ㈫ ㈵ ㈼ ㈳ ㈷ ㉀ ㉂ ㉃ ㈶ ㈾ ㈮
except the following:
⑴ ⑽-⒆ ⑵ ⒇ ⑶-⑼
㈠ ㈦ ㈢ ㈨ ㈡ ㈤ ㈧ ㈥ ㈩ ㈣

So should these be treated as So?

Enclosed CJK Letters And Months - Circled ideographs

U+3280 ( ㊀ ) CIRCLED IDEOGRAPH ONE
U+3281 ( ㊁ ) CIRCLED IDEOGRAPH TWO
U+3282 ( ㊂ ) CIRCLED IDEOGRAPH THREE
U+3283 ( ㊃ ) CIRCLED IDEOGRAPH FOUR
U+3284 ( ㊄ ) CIRCLED IDEOGRAPH FIVE
U+3285 ( ㊅ ) CIRCLED IDEOGRAPH SIX
U+3286 ( ㊆ ) CIRCLED IDEOGRAPH SEVEN
U+3287 ( ㊇ ) CIRCLED IDEOGRAPH EIGHT
U+3288 ( ㊈ ) CIRCLED IDEOGRAPH NINE
U+3289 ( ㊉ ) CIRCLED IDEOGRAPH TEN

Enclosed CJK Letters And Months - Parenthesized ideographs

U+3220 ( ㈠ ) PARENTHESIZED IDEOGRAPH ONE
U+3221 ( ㈡ ) PARENTHESIZED IDEOGRAPH TWO
U+3222 ( ㈢ ) PARENTHESIZED IDEOGRAPH THREE
U+3223 ( ㈣ ) PARENTHESIZED IDEOGRAPH FOUR
U+3224 ( ㈤ ) PARENTHESIZED IDEOGRAPH FIVE
U+3225 ( ㈥ ) PARENTHESIZED IDEOGRAPH SIX
U+3226 ( ㈦ ) PARENTHESIZED IDEOGRAPH SEVEN
U+3227 ( ㈧ ) PARENTHESIZED IDEOGRAPH EIGHT
U+3228 ( ㈨ ) PARENTHESIZED IDEOGRAPH NINE
U+3229 ( ㈩ ) PARENTHESIZED IDEOGRAPH TEN