Re: Fun with UDCs in Shift-JIS

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Thu Jan 17 2002 - 12:23:30 EST

Previous message: Addison Phillips [wM]: "RE: Fun with UDCs in Shift-JIS"
In reply to: Lars Marius Garshol: "Fun with UDCs in Shift-JIS"
Next in thread: Thomas Chan: "Re: Fun with UDCs in Shift-JIS"
Reply: Thomas Chan: "Re: Fun with UDCs in Shift-JIS"
Reply: Lars Marius Garshol: "Re: Fun with UDCs in Shift-JIS"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Lars Marius Garshol wrote:

> I've just discovered that it seems that Shift-JIS encodes a number of
> User-Defined Characters in the 0xF040 to 0xFCFC range, and that these

Yes, and every implementor may assign characters to them as they see fit.

> characters are used in web pages. Does anyone know of a source of

The problem being that most likely they are all tagged as charset="Shift_JIS", without distinguishing the variant of what's in the Shift-JIS encoding. Unreliable tagging is very common. That's one good reason why we all advocate Unicode...

> mappings for these characters, or even have information about what
> kinds of characters are found in this area?

Given how many Windows machines there are, and given that Shift-JIS seems to be more popular on Windows than on Unixes, let's look at the Shift-JIS<->Unicode mapping table for windows-932: http://oss.software.ibm.com/cvs/icu/charset/data/xml/windows-932-2000.xml?rev=1.1&content-type=text/x-cvsweb-markup
(From our collection of mapping tables at http://oss.software.ibm.com/icu/charset/)

Shift-JIS F040..F9FC appears to be contiguously and linearly mapped to U+E000..U+E757.
Some further Shift-JIS UDCs map to Unicode CJK compatibility characters U+FAxx.
Note that Windows uses some of the Unicode BMP PUA space for CJK characters in Unicode mode, for fonts and actual text processing.

Other Shift-JIS variants from different platforms will use a different assignment, but I would try the Windows variant first for whatever web page you are looking at. As a receiver, maybe you can figure out which platform generated the file, from a <meta> tag or an http server identification.

As a recommendation, if you _have_ to _generate_ Shift-JIS web pages, you should avoid UDCs and instead use NCRs (with Unicode non-PUA[!] code points).

The W3C has a page about the problems with Japanese charset identifiers and mapping tables.

markus

Previous message: Addison Phillips [wM]: "RE: Fun with UDCs in Shift-JIS"
In reply to: Lars Marius Garshol: "Fun with UDCs in Shift-JIS"
Next in thread: Thomas Chan: "Re: Fun with UDCs in Shift-JIS"
Reply: Thomas Chan: "Re: Fun with UDCs in Shift-JIS"
Reply: Lars Marius Garshol: "Re: Fun with UDCs in Shift-JIS"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Thu Jan 17 2002 - 11:45:03 EST