Lars Marius Garshol wrote:
> I've just discovered that it seems that Shift-JIS encodes a number of
> User-Defined Characters in the 0xF040 to 0xFCFC range, and that these
Yes, and every implementor may assign characters to them as they see fit.
> characters are used in web pages. Does anyone know of a source of
The problem being that most likely they are all tagged as charset="Shift_JIS", without distinguishing the variant of what's in the Shift-JIS encoding. Unreliable tagging is very common. That's one good reason why we all advocate Unicode...
> mappings for these characters, or even have information about what
> kinds of characters are found in this area?
Given how many Windows machines there are, and given that Shift-JIS seems to be more popular on Windows than on Unixes, let's look at the Shift-JIS<->Unicode mapping table for windows-932: http://oss.software.ibm.com/cvs/icu/charset/data/xml/windows-932-2000.xml?rev=1.1&content-type=text/x-cvsweb-markup
(From our collection of mapping tables at http://oss.software.ibm.com/icu/charset/)
Shift-JIS F040..F9FC appears to be contiguously and linearly mapped to U+E000..U+E757.
Some further Shift-JIS UDCs map to Unicode CJK compatibility characters U+FAxx.
Note that Windows uses some of the Unicode BMP PUA space for CJK characters in Unicode mode, for fonts and actual text processing.
Other Shift-JIS variants from different platforms will use a different assignment, but I would try the Windows variant first for whatever web page you are looking at. As a receiver, maybe you can figure out which platform generated the file, from a <meta> tag or an http server identification.
As a recommendation, if you _have_ to _generate_ Shift-JIS web pages, you should avoid UDCs and instead use NCRs (with Unicode non-PUA[!] code points).
The W3C has a page about the problems with Japanese charset identifiers and mapping tables.
markus
This archive was generated by hypermail 2.1.2 : Thu Jan 17 2002 - 11:45:03 EST