L2/01-192

Problems on Interoperativity between Unicode and CJK Local Encodings

This page introduces problems around convertion between Unicode and CJK local encodings. Mainly, non-letter symbols.

EUC-JP round-trip compatibility

This is the easiest problem. I mean, easy to understand there exists a problem, not easy to solve this problem.

In CJK world, CES (Character Encoding Scheme) and CCS (Coded Character Set) are actually different concept. I.e., one CES may contain multiple CCS. For example, EUC-JP is a CES which includes CCS of ASCII and JIS X 0208 (optionally JIS X 0201 Kana and JIS X 0212).

Unicode Consortium's conversion table from JIS X 0208 to Unicode (http://www.unicode.org/Public/MAPPINGS/EASTASIA/JIS/JIS0208.TXT). It (version 0.9, 1994-03-08) maps 0x2140 in JIS X 0208 into U+005C (REVERSE SOLIDUS). Though this is OK when JIS X 0208 is used separately, this causes a conflict of code point when used combined with ASCII for EUC-JP.

To implement EUC-JP with JIS X 0212, one more conflict problem occur. It is JIS X 0x2217 in JIS X 0212, which is mapped into U+007E by http://www.unicode.org/Public/MAPPINGS/EASTASIA/JIS/JIS0212.TXT.

Conversion tables differ between venders

There are many CES (Character Encoding Schemes) which use a common CCS (Coded Character Set). For example, CES such as EUC-JP, Shift_JIS, and CP932 include JIS X 0208 as CCS.

For these CES, character from the same CCS should be mapped into same UCS character. However, this is not realized for dozens of characters.

The following table is a table of characters with witch same character in JIS X 0208 and so on are mapped into different code points by using various conversion tables.

 
 
 
 
 
-----------------------------------------------------------------------
ORIGINAL��������������������� Converted** to U+????/EastAsianWidth
CCS���� Shift_JIS* EUC-JP*��� 0208��� SJIS��� CP932�� APPLE�� 0221A�� 0221B�� JAVAA�� JAVAB
---------------------------------------------------------------------------------------------
[ASCII]
0x5C��� ----������ 0x5C������ ----��� ----��� ----��� ----��� ----��� 005C/Na ----��� 005C/Na
0x7E��� ----������ 0x7E������ ----��� ----��� ----��� ----��� ----��� 007E/Na ----��� 007E/Na
[JISX0201 Roman]
0x5C��� 0x5C������ ----������ ----��� 00A5/Na 005C/Na 00A5/Na 00A5/Na ----��� 005C/Na 00A5/Na
0x7E��� 0x7E������ ----����� ----��� 203E/N007E/Na 007E/Na 203E/N----��� 007E/Na 203E/N
[JISX0208]
0x21310x81 0x500xA1 0xB1FFE3/FFFE3/FFFE3/FFFE3/FFFE3/F203E/NFFE3/FFFE3/F
0x213D0x81 0x5C0xA1 0xBD2015/A2015/A2015/A2014/A2014/A2014/A2015/A2015/A
0x21400x81 0x5F0xA1 0xC0005C/Na 005C/Na FF3C/FFF3C/F005C/Na FF3C/FFF3C/FFF3C/F
0x21410x81 0x600xA1 0xC1301C/W301C/WFF5E/F301C/W301C/W301C/W301C/W301C/W
0x21420x81 0x610xA1 0xC22016/A2016/A2225/A2016/A2016/A2016/A2016/A2016/A
0x215D0x81 0x7C0xA1 0xDD2212/N2212/NFF0D/F2212/N2212/N2212/N2212/N2212/N
0x216F0x81 0x8F0xA1 0xEFFFE5/FFFE5/FFFE5/FFFE5/FFFE5/F00A5/Na FFE5/FFFE5/F
0x21710x81 0x910xA1 0xF100A2/Na 00A2/Na FFE0/F00A2/Na 00A2/Na 00A2/Na 00A2/Na 00A2/Na
0x21720x81 0x920xA1 0xF200A3/Na 00A3/Na FFE1/F00A3/Na 00A3/Na 00A3/Na 00A3/Na 00A3/Na
0x224C0x81 0xCA0xA2 0xCC00AC/Na 00AC/Na FFE2/F00AC/Na 00AC/Na 00AC/Na 00AC/Na 00AC/Na
[JISX0212]
0x2217----������ 0x8F,A2,97 ----��� ----��� ----��� ----��� 007E/Na FF5E/F----��� ----
---------------------------------------------------------------------------------------------

Note 1 This table mentions Japanese encodings only.

Note 2 This table doesn't contain vendors' extended characters (invalid characters in formal EUC_JP and Shift_JIS).

Note * Converted from ASCII, JISX0201 Roman, and JISX0208 algorithmically. The algorithm for EUC-JP is described in http://www.unicode.org/Public/MAPPINGS/EASTASIA/JIS/JIS0208.TXT. The algorithm to convert from JIS X 0208 to Shift_JIS is:

out1 = (((in1 - 1) >> 1) + (in1 <= 0x5e) ? 0x71 : 0xb1);
out2 = in2 + ((in1 & 1) ? ((in2 < 0x60) ? 0x1f : 0x20) : 0x7e);

where in1 and in2 are the 1st and 2nd bytes of JIS X 0208 respectively and out1 and out2 are the 1st and 2nd bytes of Shift_JIS. Shift_JIS value is used for original code for conversion of "SJIS", "CP932", "Win98", and "Apple", because all of them (other than Shift_JIS itself) are supersets of Shift_JIS.

Note **

Thus, same characters in Japanese encodings is mapped into different Unicode characters, according to the conversion table. Especially, CP932 (which has relatively more differences) is called Shift_JIS in Microsoft OSes and very widely used. This will introduce vast problems in future when Unicode will be more popular in Japan.

Width problems

Computers have been used for long years in the CJK world, as Euro-American world. Ideographs have occupied two columns in terminal-based softwares and hardwares since CJK people had come to use Ideographs by computers. Thus, there are singlewidth or narrow ("Hankaku" in Japanese) characters and doublewidth or wide ("Zenkaku" in Japanese) characters. Though there are no official standards which mention width of characters (at least in Japan), the concept of width is very strong de-facto standard in CJK world.

In CJK local encodings, it is very easy to tell which a character is singlewidth or doublewidth. Characters from ISO 646 (ASCII, JIS X 0201 Roman, and so on) and JIS X 0201 Kana are singlewidth and others are doublewidth. CJK people have long history (tens of years) to widely rely on this de-facto standard and IMO this proves the de-facto standard has no fatal problems. Thus, Unicode and its conversion tables are responsible to the problem I am going to explain below.

Unicode Consortium supplies Unicode Standard Annex #11 EAST ASIAN WIDTH (UAX#11, former UTR#11) in order to keep compatibility to CJK the de-facto standard. It classifies UCS characters into a few categories - "N", "A", "H", "W", "F", and "Na".

To keep compatibility with CJK de-facto standard, characters from ISO 646 (ASCII, JIS X 0201 Roman, and so on) and JIS X 0201 Kana have to have "Na" or "H" and others have to have "W", "F", or "A" in CJK encodings. In addition, appearance of "N" should be regarded as a bug of UAX#11.

I checked by using a script and found the following problems in EastAsianWidth.txt

FILE JIS0208.TXT------
0x2140U+005CNa# REVERSE SOLIDUS
0x215DU+2212N# MINUS SIGN
0x2171U+00A2Na# CENT SIGN
0x2172U+00A3Na# POUND SIGN
0x224CU+00ACNa# NOT SIGN
FILE JIS0212.TXT------
0x2234U+00AFNa# MACRON
0x2237U+007ENa# TILDE
0x2238U+0384N# GREEK TONOS
0x2239U+0385N# GREEK DIALYTIKA TONOS
0x2243U+00A6Na# BROKEN BAR
0x226DU+00A9N# COPYRIGHT SIGN
0x226EU+00AEN# REGISTERED SIGN
0x2271U+2116N# NUMERO SIGN
0x2661U+0386N# GREEK CAPITAL LETTER ALPHA WITH TONOS
0x2662U+0388N# GREEK CAPITAL LETTER EPSILON WITH TONOS
0x2663U+0389N# GREEK CAPITAL LETTER ETA WITH TONOS
0x2664U+038AN# GREEK CAPITAL LETTER IOTA WITH TONOS
0x2665U+03AAN# GREEK CAPITAL LETTER IOTA WITH DIALYTIKA
0x2667U+038CN# GREEK CAPITAL LETTER OMICRON WITH TONOS
0x2669U+038EN# GREEK CAPITAL LETTER UPSILON WITH TONOS
0x266AU+03ABN# GREEK CAPITAL LETTER UPSILON WITH DIALYTIKA
0x266CU+038FN# GREEK CAPITAL LETTER OMEGA WITH TONOS
0x2671U+03ACN # GREEK SMALL LETTER ALPHA WITH TONOS
0x2672U+03ADN# GREEK SMALL LETTER EPSILON WITH TONOS
0x2673U+03AEN# GREEK SMALL LETTER ETA WITH TONOS
0x2674U+03AFN# GREEK SMALL LETTER IOTA WITH TONOS
0x2675U+03CAN# GREEK SMALL LETTER IOTA WITH DIALYTIKA
0x2676U+0390N# GREEK SMALL LETTER IOTA WITH DIALYTIKA AND TONOS
0x2677U+03CCN# GREEK SMALL LETTER OMICRON WITH TONOS
0x2678U+03C2N# GREEK SMALL LETTER FINAL SIGMA
0x2679U+03CDN# GREEK SMALL LETTER UPSILON WITH TONOS
0x267AU+03CBN# GREEK SMALL LETTER UPSILON WITH DIALYTIKA
0x267BU+03B0N# GREEK SMALL LETTER UPSILON WITH DIALYTIKA AND TONOS
0x267CU+03CEN# GREEK SMALL LETTER OMEGA WITH TONOS
0x2742U+0402N# CYRILLIC CAPITAL LETTER DJE
0x2743U+0403N# CYRILLIC CAPITAL LETTER GJE
0x2744U+0404N# CYRILLIC CAPITAL LETTER UKRAINIAN IE
0x2745U+0405N# CYRILLIC CAPITAL LETTER DZE
0x2746U+0406N# CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I
0x2747U+0407N# CYRILLIC CAPITAL LETTER YI
0x2748U+0408N# CYRILLIC CAPITAL LETTER JE
0x2749U+0409N# CYRILLIC CAPITAL LETTER LJE
0x274AU+040AN# CYRILLIC CAPITAL LETTER NJE
0x274BU+040BN# CYRILLIC CAPITAL LETTER TSHE
0x274CU+040CN# CYRILLIC CAPITAL LETTER KJE
0x274DU+040EN# CYRILLIC CAPITAL LETTER SHORT U
0x274EU+040FN# CYRILLIC CAPITAL LETTER DZHE
0x2772U+0452N# CYRILLIC SMALL LETTER DJE
0x2773U+0453N# CYRILLIC SMALL LETTER GJE
0x2774U+0454N# CYRILLIC SMALL LETTER UKRAINIAN IE
0x2775U+0455N# CYRILLIC SMALL LETTER DZE
0x2776U+0456N# CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I
0x2777U+0457N# CYRILLIC SMALL LETTER YI
0x2778U+0458N# CYRILLIC SMALL LETTER JE
0x2779U+0459N# CYRILLIC SMALL LETTER LJE
0x277AU+045AN# CYRILLIC SMALL LETTER NJE
0x277BU+045BN# CYRILLIC SMALL LETTER TSHE
0x277CU+045CN# CYRILLIC SMALL LETTER KJE
0x277DU+045EN# CYRILLIC SMALL LETTER SHORT U
0x277EU+045FN# CYRILLIC SMALL LETTER DZHE
0x2922U+0110N# LATIN CAPITAL LETTER D WITH STROKE
0x294BU+014BN# LATIN SMALL LETTER ENG
0x2A21U+00C1N# LATIN CAPITAL LETTER A WITH ACUTE
0x2A22U+00C0N# LATIN CAPITAL LETTER A WITH GRAVE
0x2A23U+00C4N# LATIN CAPITAL LETTER A WITH DIAERESIS
0x2A24U+00C2N# LATIN CAPITAL LETTER A WITH CIRCUMFLEX
0x2A25U+0102N# LATIN CAPITAL LETTER A WITH BREVE
0x2A26U+01CDN# LATIN CAPITAL LETTER A WITH CARON
0x2A27U+0100N# LATIN CAPITAL LETTER A WITH MACRON
0x2A28U+0104N# LATIN CAPITAL LETTER A WITH OGONEK
0x2A29U+00C5N# LATIN CAPITAL LETTER A WITH RING ABOVE
0x2A2AU+00C3N# LATIN CAPITAL LETTER A WITH TILDE
0x2A2BU+0106N# LATIN CAPITAL LETTER C WITH ACUTE
0x2A2CU+0108N# LATIN CAPITAL LETTER C WITH CIRCUMFLEX
0x2A2DU+010CN# LATIN CAPITAL LETTER C WITH CARON
0x2A2EU+00C7N# LATIN CAPITAL LETTER C WITH CEDILLA
0x2A2FU+010AN# LATIN CAPITAL LETTER C WITH DOT ABOVE
0x2A30U+010EN# LATIN CAPITAL LETTER D WITH CARON
0x2A31U+00C9N# LATIN CAPITAL LETTER E WITH ACUTE
0x2A32U+00C8N# LATIN CAPITAL LETTER E WITH GRAVE
0x2A33U+00CBN# LATIN CAPITAL LETTER E WITH DIAERESIS
0x2A34U+00CAN# LATIN CAPITAL LETTER E WITH CIRCUMFLEX
0x2A35U+011AN# LATIN CAPITAL LETTER E WITH CARON
0x2A36U+0116N# LATIN CAPITAL LETTER E WITH DOT ABOVE
0x2A37U+0112N# LATIN CAPITAL LETTER E WITH MACRON
0x2A38U+0118N# LATIN CAPITAL LETTER E WITH OGONEK
0x2A3AU+011CN# LATIN CAPITAL LETTER G WITH CIRCUMFLEX
0x2A3BU+011EN# LATIN CAPITAL LETTER G WITH BREVE
0x2A3CU+0122N# LATIN CAPITAL LETTER G WITH CEDILLA
0x2A3DU+0120N# LATIN CAPITAL LETTER G WITH DOT ABOVE
0x2A3EU+0124N# LATIN CAPITAL LETTER H WITH CIRCUMFLEX
0x2A3FU+00CDN# LATIN CAPITAL LETTER I WITH ACUTE
0x2A40U+00CCN# LATIN CAPITAL LETTER I WITH GRAVE
0x2A41U+00CFN# LATIN CAPITAL LETTER I WITH DIAERESIS
0x2A42U+00CEN# LATIN CAPITAL LETTER I WITH CIRCUMFLEX
0x2A43U+01CFN# LATIN CAPITAL LETTER I WITH CARON
0x2A44U+0130N# LATIN CAPITAL LETTER I WITH DOT ABOVE
0x2A45U+012AN# LATIN CAPITAL LETTER I WITH MACRON
0x2A46U+012EN# LATIN CAPITAL LETTER I WITH OGONEK
0x2A47U+0128N# LATIN CAPITAL LETTER I WITH TILDE
0x2A48U+0134N# LATIN CAPITAL LETTER J WITH CIRCUMFLEX
0x2A49U+0136N# LATIN CAPITAL LETTER K WITH CEDILLA
0x2A4AU+0139N# LATIN CAPITAL LETTER L WITH ACUTE
0x2A4BU+013DN# LATIN CAPITAL LETTER L WITH CARON
0x2A4CU+013BN# LATIN CAPITAL LETTER L WITH CEDILLA
0x2A4DU+0143N# LATIN CAPITAL LETTER N WITH ACUTE
0x2A4EU+0147N# LATIN CAPITAL LETTER N WITH CARON
0x2A4FU+0145N# LATIN CAPITAL LETTER N WITH CEDILLA
0x2A50U+00D1N# LATIN CAPITAL LETTER N WITH TILDE
0x2A51U+00D3N# LATIN CAPITAL LETTER O WITH ACUTE
0x2A52U+00D2N# LATIN CAPITAL LETTER O WITH GRAVE
0x2A53U+00D6N# LATIN CAPITAL LETTER O WITH DIAERESIS
0x2A54U+00D4N# LATIN CAPITAL LETTER O WITH CIRCUMFLEX
0x2A55U+01D1N# LATIN CAPITAL LETTER O WITH CARON
0x2A56U+0150N# LATIN CAPITAL LETTER O WITH DOUBLE ACUTE
0x2A57U+014CN# LATIN CAPITAL LETTER O WITH MACRON
0x2A58U+00D5N# LATIN CAPITAL LETTER O WITH TILDE
0x2A59U+0154N# LATIN CAPITAL LETTER R WITH ACUTE
0x2A5AU+0158N# LATIN CAPITAL LETTER R WITH CARON
0x2A5BU+0156N# LATIN CAPITAL LETTER R WITH CEDILLA
0x2A5CU+015AN# LATIN CAPITAL LETTER S WITH ACUTE
0x2A5DU+015CN# LATIN CAPITAL LETTER S WITH CIRCUMFLEX
0x2A5EU+0160N# LATIN CAPITAL LETTER S WITH CARON
0x2A5FU+015EN# LATIN CAPITAL LETTER S WITH CEDILLA
0x2A60U+0164N# LATIN CAPITAL LETTER T WITH CARON
0x2A61U+0162N# LATIN CAPITAL LETTER T WITH CEDILLA
0x2A62U+00DAN# LATIN CAPITAL LETTER U WITH ACUTE
0x2A63U+00D9N# LATIN CAPITAL LETTER U WITH GRAVE
0x2A64U+00DCN# LATIN CAPITAL LETTER U WITH DIAERESIS
0x2A65U+00DBN# LATIN CAPITAL LETTER U WITH CIRCUMFLEX
0x2A66U+016CN# LATIN CAPITAL LETTER U WITH BREVE
0x2A67U+01D3N# LATIN CAPITAL LETTER U WITH CARON
0x2A68U+0170N# LATIN CAPITAL LETTER U WITH DOUBLE ACUTE
0x2A69U+016AN# LATIN CAPITAL LETTER U WITH MACRON
0x2A6AU+0172N# LATIN CAPITAL LETTER U WITH OGONEK
0x2A6BU+016EN# LATIN CAPITAL LETTER U WITH RING ABOVE
0x2A6CU+0168N# LATIN CAPITAL LETTER U WITH TILDE
0x2A6DU+01D7N# LATIN CAPITAL LETTER U WITH DIAERESIS AND ACUTE
0x2A6EU+01DBN# LATIN CAPITAL LETTER U WITH DIAERESIS AND GRAVE
0x2A6FU+01D9N# LATIN CAPITAL LETTER U WITH DIAERESIS AND CARON
0x2A70U+01D5N# LATIN CAPITAL LETTER U WITH DIAERESIS AND MACRON
0x2A71U+0174N# LATIN CAPITAL LETTER W WITH CIRCUMFLEX
0x2A72U+00DDN# LATIN CAPITAL LETTER Y WITH ACUTE
0x2A73U+0178N# LATIN CAPITAL LETTER Y WITH DIAERESIS
0x2A74U+0176N# LATIN CAPITAL LETTER Y WITH CIRCUMFLEX
0x2A75U+0179N # LATIN CAPITAL LETTER Z WITH ACUTE
0x2A76U+017DN# LATIN CAPITAL LETTER Z WITH CARON
0x2A77U+017BN# LATIN CAPITAL LETTER Z WITH DOT ABOVE
0x2B23U+00E4N# LATIN SMALL LETTER A WITH DIAERESIS
0x2B24U+00E2N# LATIN SMALL LETTER A WITH CIRCUMFLEX
0x2B25U+0103N# LATIN SMALL LETTER A WITH BREVE
0x2B28U+0105N# LATIN SMALL LETTER A WITH OGONEK
0x2B29U+00E5N# LATIN SMALL LETTER A WITH RING ABOVE
0x2B2AU+00E3N# LATIN SMALL LETTER A WITH TILDE
0x2B2BU+0107N# LATIN SMALL LETTER C WITH ACUTE
0x2B2CU+0109N# LATIN SMALL LETTER C WITH CIRCUMFLEX
0x2B2DU+010DN# LATIN SMALL LETTER C WITH CARON
0x2B2EU+00E7N# LATIN SMALL LETTER C WITH CEDILLA
0x2B2FU+010BN# LATIN SMALL LETTER C WITH DOT ABOVE
0x2B30U+010FN# LATIN SMALL LETTER D WITH CARON
0x2B33U+00EBN# LATIN SMALL LETTER E WITH DIAERESIS
0x2B36U+0117N# LATIN SMALL LETTER E WITH DOT ABOVE
0x2B38U+0119N# LATIN SMALL LETTER E WITH OGONEK
0x2B39U+01F5N# LATIN SMALL LETTER G WITH ACUTE
0x2B3AU+011DN# LATIN SMALL LETTER G WITH CIRCUMFLEX
0x2B3BU+011FN# LATIN SMALL LETTER G WITH BREVE
0x2B3DU+0121N# LATIN SMALL LETTER G WITH DOT ABOVE
0x2B3EU+0125N# LATIN SMALL LETTER H WITH CIRCUMFLEX
0x2B41U+00EFN# LATIN SMALL LETTER I WITH DIAERESIS
0x2B42U+00EEN# LATIN SMALL LETTER I WITH CIRCUMFLEX
0x2B46U+012FN# LATIN SMALL LETTER I WITH OGONEK
0x2B47U+0129N# LATIN SMALL LETTER I WITH TILDE
0x2B48U+0135N# LATIN SMALL LETTER J WITH CIRCUMFLEX
0x2B49U+0137N# LATIN SMALL LETTER K WITH CEDILLA
0x2B4AU+013AN# LATIN SMALL LETTER L WITH ACUTE
0x2B4BU+013EN# LATIN SMALL LETTER L WITH CARON
0x2B4CU+013CN# LATIN SMALL LETTER L WITH CEDILLA
0x2B4FU+0146N# LATIN SMALL LETTER N WITH CEDILLA
0x2B50U+00F1N# LATIN SMALL LETTER N WITH TILDE
0x2B53U+00F6N# LATIN SMALL LETTER O WITH DIAERESIS
0x2B54U+00F4N# LATIN SMALL LETTER O WITH CIRCUMFLEX
0x2B56U+0151N# LATIN SMALL LETTER O WITH DOUBLE ACUTE
0x2B58U+00F5N# LATIN SMALL LETTER O WITH TILDE
0x2B59U+0155N# LATIN SMALL LETTER R WITH ACUTE
0x2B5AU+0159N# LATIN SMALL LETTER R WITH CARON
0x2B5BU+0157N# LATIN SMALL LETTER R WITH CEDILLA
0x2B5CU+015BN# LATIN SMALL LETTER S WITH ACUTE
0x2B5DU+015DN# LATIN SMALL LETTER S WITH CIRCUMFLEX
0x2B5EU+0161N# LATIN SMALL LETTER S WITH CARON
0x2B5FU+015FN# LATIN SMALL LETTER S WITH CEDILLA
0x2B60U+0165N# LATIN SMALL LETTER T WITH CARON
0x2B61U+0163N# LATIN SMALL LETTER T WITH CEDILLA
0x2B65U+00FBN# LATIN SMALL LETTER U WITH CIRCUMFLEX
0x2B66U+016DN# LATIN SMALL LETTER U WITH BREVE
0x2B68U+0171N# LATIN SMALL LETTER U WITH DOUBLE ACUTE
0x2B6AU+0173N# LATIN SMALL LETTER U WITH OGONEK
0x2B6BU+016FN# LATIN SMALL LETTER U WITH RING ABOVE
0x2B6CU+0169N# LATIN SMALL LETTER U WITH TILDE
0x2B71U+0175N# LATIN SMALL LETTER W WITH CIRCUMFLEX
0x2B72U+00FDN# LATIN SMALL LETTER Y WITH ACUTE
0x2B73U+00FFN# LATIN SMALL LETTER Y WITH DIAERESIS
0x2B74U+0177N# LATIN SMALL LETTER Y WITH CIRCUMFLEX
0x2B75U+017AN# LATIN SMALL LETTER Z WITH ACUTE
0x2B76U+017EN# LATIN SMALL LETTER Z WITH CARON
0x2B77U+017CN# LATIN SMALL LETTER Z WITH DOT ABOVE
FILE SHIFTJIS.TXT------
0x7EU+203EN# OVERLINE
0x815FU+005CNa# REVERSE SOLIDUS
0x817CU+2212N# MINUS SIGN
0x8191U+00A2Na# CENT SIGN
0x8192U+00A3Na# POUND SIGN
0x81CAU+00ACNa# NOT SIGN
FILE CP932.TXT------
0x8782U+2116N#NUMERO SIGN
0xFA59U+2116N#NUMERO SIGN
FILE JAPANESE.TXT------
FILE GB2312.TXT------
0x216DU+2116N# NUMERO SIGN
FILE CHINSIMP.TXT------
FILE BIG5.TXT------
0xA145U+2022N# BULLET
0xA14EU+FF64H# HALFWIDTH IDEOGRAPHIC COMMA
0xA1C2U+203EN# OVERLINE
0xA1F2U+2641N# EARTH
0xA244U+00A5Na# YEN SIGN
0xA246U+00A2Na# CENT SIGN
0xA247U+00A3Na# POUND SIGN
FILE CHINTRAD.TXT------
FILE KSX1001.TXT------
FILE KOREAN.TXT------

The script is following:

#!/usr/bin/perl
 
open(FILE, "EastAsianWidth.txt") || die "Cannot open width file.";
while($a = <FILE>) {
������� $a =~ /^([0-9A-F]+);([A-Za-z]+)/;
������� $num = $1; $w = $2;
������� if ($num eq "") {next;}
������� $width{$num} = $w;
}
close(FILE);
 
sub checkfile($$$$) {
������� my($file, $localcolumn, $ucscolumn, $commentcolumn)=@_;
������� open(FILE, $file) || die "Cannot open $file";
������� print "FILE $file------\n";
������� while($a = <FILE>) {
������� ��� if ($a =~ /^\#/) {next;}
������� ��� chomp($a);
������� ��� @list = split(/\t/, $a);
������� ��� $loc = $list[$localcolumn];
������� ��� $ucs = $list[$ucscolumn];
������� ��� if ($ucs < 0x20 || ($ucs >= 0x7f && $ucs <= 0x9f)) {next;}
������� ��� $ucs =~ s/0x//;
������� ��� $width = $width{$ucs};
������� ��� $com = $list[$commentcolumn];
������� ��� if ($loc < 0x100 && 
�������������� ($width eq "W" || $width eq "F" || $width eq "A" || $width eq "N")) {
�������������� print "$locU+$ucs$width$com\n";
������� ��� } elsif ($loc > 0x100 && 
�������������� ($width eq "N" || $width eq "H" || $width eq "Na")) {
�������������� print "$locU+$ucs$width$com\n";
������� ��� }
������� ��� 
������� }
}
 
&checkfile("JIS0208.TXT", 1, 2, 3);
&checkfile("JIS0212.TXT", 0, 1, 2);
&checkfile("SHIFTJIS.TXT", 0, 1, 2);
&checkfile("CP932.TXT", 0, 1, 2);
&checkfile("JAPANESE.TXT", 0, 1, 2);
&checkfile("GB2312.TXT", 0, 1, 2);
&checkfile("CHINSIMP.TXT", 0, 1, 2);
&checkfile("BIG5.TXT", 0, 1, 2);
&checkfile("CHINTRAD.TXT", 0, 1, 2);
&checkfile("KSX1001.TXT", 0, 1, 2);
&checkfile("KOREAN.TXT", 0, 1, 2);

Note the limit of this research that only conversion tables from Unicode Consortium are examined.

This result can be regarded as a bug of UAX#11 or a bug of conversion tables. For some cases, this problem can be fixed by only modifying UAX#11, like the following:

However, for U+005c REVERSE SOLIDUS (\), we cannot modify UAX#11 to satisfy all encodings which I tested now, because some tables (such as JIS0208.TXT and SHIFTJIS.TXT) need U+005C to be doublewidth while other tables (such as CP932) need U+005C to be singlewidth. As a standard, Unicode can classify U+005C into "A". However, some softwares will consider "A" characters as doublewidth when compatiblity is needed. Thus, the only solution is to modify conversion tables. I imagine that the most moderate solution is to classify U+005C into "Na" and modify JIS0208.TXT, SHIFTJIS.TXT, and JIS X 0221 to convert JIS X 0208 0x2140 into U+FF3C (\).

Other similar problematic characters are:

I think the line

0xA14EU+FF64H# HALFWIDTH IDEOGRAPHIC COMMA

in BIG5.TXT is a bug. 0xA14E must be doublewidth while U+FF64 must be "H". Thus, conversion table BIG5.TXT must be modified.

JIS X 0213

Unicode Consortium has not yet released conversion table for JIS X 0213. Since this new Japanese national standard includes many non-letter symbols, new examples of these problems will appear.

I imagine that much more non-letter symbols will be needed to classified to "A" in UAX#11.

ASCII and JIS X 0201 Roman

When converting EUC-JP and Shift_JIS, handling of 0x5c and 0x7e can be a problem. Since both encodings have long history and Japanese people have lot of experience how to handle them, I now introduce it.

Solution is very simple. Just regard YEN SIGN and REVERSE SOLIDUS as a different glyphs of the same character. Then, distinction between ASCII and JIS X 0201 Roman can be neglected.

Thus, when a Japanese person (almost Japanese people don't know about encoding; a certain amount of people [Windows and Macintosh users] know the word "Shift_JIS" as the only usable encoding) says "Shift_JIS", almost always it means "CP932".

Please don't blame such Japanese people who don't aware of distinction between Shift_JIS and CP932. The difference between Shift_JIS and CP932 was only that CP932 has extension characters. It is the introduction of Unicode and conversion to/from it that brought a confusing incompatibility of non-letter symbols between Shift_JIS and CP932.

The reason why I wrote that when a Japanese person says "Shift_JIS", almost always it means "CP932" is the following. For example, DOS/Windows programmers write YEN SIGN + "n" to mean new line (in Shift_JIS, strictly speaking, CP932). DOS/Windows use YEN SIGN (0x5c) for directory name separator. This is why Microsoft cannot convert 0x5c in CP932 into characters other than U+005C.

Not only Windows users but also UNIX users regarded 0x5c in Shift_JIS as an ambiguous character of YEN SIGN and REVERSE SOLIDUS. For example, popular Japanese encode converters such as nkf and qkc don't care about distinction between ASCII and JIS X 0201 Kana. When I often use TeraTerm, a telnet/ssh client for Windows, and read YEN SIGN, I read it as a REVERSE SOLIDUS according to the context. (When a Japanese person is a writer, it means YEN SIGN in most cases. When a non-Japanese person is a writer, it always means REVERSE SOLIDUS).

Thus, I don't complain if 0x5c in Shift_JIS is mapped into U+005C. Rather, distinction of them (i.e., being strict to official standards) might confuse many Japanese people.


Tomohiro KUBOTA mailto:%20kubota@debian.org