It was IE (re: Unicode SGML entities in application/x-www-form-urlencoded)

From: Adrian Havill (havill@threeweb.ad.jp)
Date: Sat Apr 18 1998 - 23:38:34 EDT


Finally found the problem.

IE 4 is outputting SGML entities which map to Unicode when the character input
into a <FORM> is not in the target character set (such as the one in the charset
attribute in the Content-Type... either set via a <META> tag or by the HTTP
server).

In other words, if someone types "Nihongo" in Han characters followed by the
U+0100 (Latin Capital Letter A with Macron) {This can be done by using a the
Windows NT "Unicode Character Map" tool and cut-n-paste} into a form that has
encoding set to "iso-8859-1," it will send the Macroned A out as "&#256;" and
will encode the "Nihongo" in Shift-JIS on a Japanese version of IE 4.

On an U.S. version, both the "Nihongo" and the Macroned "A" will get SGML
entitified (into three entities), because neither is in the "native" character
set.

Bizarre because the character set for the page is ISO-8859-1, but it's still
sending MBCS Shift-JIS.

Even more bizarre is that this is NOT done if the character is in ISO-8859-1.
Instead, if the character is in ISO-8859-1, the character gets "remapped" into
the nearest ASCII look-alike (C with cedilla gets mapped to "C", E with acute
gets mapped to "E"), which I feel is _bad_ behavior in terms of I18N.

This will be done (&#xxxxx;) even if the FORM is encoded as
"multipart/form-data"... which is capable of specifying the character set in the
Content-Type-- which is a shame because the nice thing about the new encoding
type is that it's supposed to help alleviate the lack of character set info
problem that plagues application/x-www-form-urlencoded.

====

So my revised question is: Aside from the bizarre policy for remapping (which is
definitely a BUG), is the sending of Unicode that's not in the ASCII subset as
SGML numeric generic entities in application/x-www-form-urlencoded (which is not
designed to handle anything other than ASCII) to be expected from here on? The
HTML 4.0 docs does not seem to get into this, instead advising that
application/x-www-form-urlencoded is designed for ASCII only (although it's not
used for this) and recommends multipart/form-data for all other character
encodings.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT