The Script_Extensions property values for some characters contain Hiragana, Katakana, or Bopomofo, when they should only contain Han. The UTC is considering removing the Hiragana, Katakana, or Bopomofo in these cases, and would like feedback as to any that should not be changed, and any others that should be.
Mistaken Script_Extensions values cause false positives in confusability code and other processing. For example, it causes the following to be considered whole-script confusables:
ー U+30FC KATAKANA-HIRAGANA PROLONGED SOUND MARK
㇐ U+31D0 CJK STROKE H
Hiragana and Katakana are, of course, part of the Japanese writing systems, which also uses Han. But the following characters are not part of the Hiragana and Katakana scripts, and should have those scripts removed from their Script_Extensions values. Similarly, Bopomofo should be removed where it appears below.
The list excludes characters that don’t contain ideographs, or CJK strokes. However, it includes a few that others that appear to be specifically for use with ideographics, like IDEOGRAPHIC ANNOTATION LINKING MARK or IDEOGRAPHIC VARIATION INDICATOR, and don’t seem particularly likely to be interspersed with pure Hiragana or Katakana text. In review, please pay special attention to those characters.
Doing the analysis also picked up 6 circled ideographic characters that have Script_Extensions=Common when they probably should have Script_Extensions=Han, so those are also included.
from: Bopomofo,Han,Hangul,Hiragana,Katakana
to: Han,Hangul
303E ; IDEOGRAPHIC VARIATION INDICATOR
303F ; IDEOGRAPHIC HALF FILL SPACE
31C0..31E3 ; CJK STROKE T
.. CJK STROKE Q
3220..3243 ; PARENTHESIZED IDEOGRAPH ONE
.. PARENTHESIZED IDEOGRAPH REACH
3280..32B0; CIRCLED IDEOGRAPH ONE
.. CIRCLED IDEOGRAPH NIGHT
32C0..32CB ; IDEOGRAPHIC TELEGRAPH SYMBOL FOR
JANUARY
.. IDEOGRAPHIC TELEGRAPH SYMBOL FOR DECEMBER
3358..3370 ; IDEOGRAPHIC TELEGRAPH SYMBOL FOR
HOUR ZERO
.. IDEOGRAPHIC TELEGRAPH SYMBOL FOR HOUR TWENTY-FOUR
337B..337F ; SQUARE ERA NAME HEISEI .. SQUARE CORPORATION
33E0..33FE ; IDEOGRAPHIC TELEGRAPH SYMBOL FOR
DAY ONE
.. IDEOGRAPHIC TELEGRAPH SYMBOL FOR DAY THIRTY-ONE
from: Han,Hiragana,Katakana
to: Han
3190..319F ; IDEOGRAPHIC ANNOTATION LINKING
MARK
.. IDEOGRAPHIC ANNOTATION MAN MARK
from: Common
to: Han
3244..3247 ; CIRCLED IDEOGRAPH QUESTION
.. CIRCLED IDEOGRAPH KOTO
1F250 ; CIRCLED IDEOGRAPH ADVANTAGE
1F251 ; CIRCLED IDEOGRAPH ACCEPT
The full set of characters that would be affected is:
[〾〿㆐-㆟㇀-㇣㈠-㉇㊀-㊰㋀-㋋㍘-㍰ ㍻-㍿㏠-㏾🉐🉑]
For comparison, the following list includes other characters whose Script_Extensions values contain Han and Hiragana, Katakana, or Bopomofo. These are not currently part of the proposal, but we’d like feedback as to whether any should be.
Script_Extensions=Bopomofo,Han items: 4
302A..302D ; IDEOGRAPHIC LEVEL TONE MARK
.. IDEOGRAPHIC ENTERING TONE MARK // GC=NSM
Script_Extensions=Bopomofo,Han,Hangul,Hiragana,Katakana items: 10
3003 ; DITTO MARK
3013 ; GETA MARK // GC=Other_Symbol
301C..301F ; WAVE DASH
..LOW DOUBLE PRIME QUOTATION MARK
3030 ; WAVY DASH
3037 ; IDEOGRAPHIC TELEGRAPH LINE FEED SEPARATOR SYMBOL // GC=Other_Symbol
FE45 ; SESAME DOT
FE46 ; WHITE SESAME DOT
Script_Extensions=Bopomofo,Han,Hangul,Hiragana,Katakana,Yi items: 26
3001 ; IDEOGRAPHIC COMMA
3002 ; IDEOGRAPHIC FULL STOP
3008..3011 ; LEFT ANGLE BRACKET
.. RIGHT BLACK LENTICULAR BRACKET
3014..301B ; LEFT TORTOISE SHELL BRACKET
.. RIGHT WHITE SQUARE BRACKET
30FB ; KATAKANA MIDDLE DOT
FF61..FF65 ; HALFWIDTH IDEOGRAPHIC FULL STOP
.. HALFWIDTH KATAKANA MIDDLE DOT
Script_Extensions=Han,Hiragana,Katakana items: 3
3006 ; IDEOGRAPHIC CLOSING MARK // GC=Other_Letter
303C ; MASU MARK // GC=Other_Letter
303D ; PART ALTERNATION MARK
The full set of 43 comparison characters is: