Re: Round trip mapping - SJI S to Unicode

From: Ken Lunde (lunde@adobe.com)
Date: Mon Aug 24 1998 - 10:43:09 EDT


Abhishek,

You wrote:

>> It does not matter because the characters are just duplicates.
>> For e.g. in the case of 0x879c --> Ux222a --> 0x81be, 0x879c, Ux222a and
>> 0x879c represent the same character. 0x879c and 0x81be are just duplicates.

The fundamental issue here is how to handle such beasts. Luckily, in
the case of Microsoft's Japanese character set, for every case of
duplicate encoding, there is *always* a preferred code point for
round-trip. For example, it is easy to force 0x81BE and 0x879C to
become Ux222A. But, when you convert back into Shift-JIS, what code
point to use? In this case, 0x81BE is the preferred code point. 0x879C
is a character from NEC Row 13, which was developed to work with
JIS78. However, several characters in NEC Row 13 were added to JIS83
(to Row 2), thus making some in NEC Row 13 duplicates if the rest of
the character set conforms to JIS83 (or JIS90 or JIS97), such as is
the case for Windows-J character set.

Anyway, the XKP specification defines what the preferred mappings
should be for such cases. See:

  http://www.xkp.or.jp/

The basic rules are:

o If the character is in both JIS83 and NEC Row 13 (and possibly an
  IBM Selected character), the JIS83 code point is preferred.

o If the character is in the IBM Selected set (NEC and IBM positions),
  the IBM position is preferred.

o If the character is in NEC Row 13 and the IBM Selected set, the NEC
  Row 13 code point is preferred.

I have developed a machine-readable file that contains these mappings,
and demonstrates what the preferred ones are. If anyone is interested,
I can send it privately (it is about 100K).

Interestingly, there is one case of three mappings:

  0x81CA, 0xEEF9, 0xFA54 -> U+FFE2 -> 0x81CA

Hope this helps...

-- Ken



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:41 EDT