Re: It was IE (re: Unicode SGML entities in application/x-www-form-urlencoded)

From: Chris Wendt (christw@microsoft.com)
Date: Sun Apr 19 1998 - 20:13:48 EDT


>IE 4 is outputting SGML entities which map to Unicode when the character
>input into a <FORM> is not in the target character set

This part of your analysis of IE4 behavior is correct.

>On an U.S. version, both the "Nihongo" and the Macroned "A" will get
>SGML entitified (into three entities), because neither is in the "native"
>character set.

This part is not correct analysis. The encoding of the FORM data does not
depend on the language of the browser or the language of the OS but on the
charset that IE recognizes the page to be encoded in. You can force a
certain encoding by the standard methods <META http-equiv....charset=....>
in the document or http header. Additional requirement is that the
appropriate language support is installed on the client machine. For
example, the label <META HTTP-EQUIV="Content-Type" content="text/html;
charset=shift_jis"> is only recognized if there is Japanese support present
on the system. Otherwise whatever encoding user chose under the
View.Language menu is used in the FORM submission.

>Bizarre because the character set for the page is ISO-8859-1, but it's
still
>sending MBCS Shift-JIS.

This would not happen if the page was labeled as iso-8859-1. Any language
version of Windows 95 and Windows NT has support for iso-8859-1, including
Japanese Windows 95. I recommend here to label the page as UTF-8, then you
don't have to worry about the &#nnnnn; encoded characters, you always get
nice and clean UTF-8 in the response.

>Instead, if the character is in ISO-8859-1, the character gets "remapped"
>into the nearest ASCII look-alike (C with cedilla gets mapped to "C",
>E with acute gets mapped to "E"), which I feel is _bad_ behavior in
>terms of I18N.

Correct. If the submission is in Shift-JIS, some of the Latin1 8-bit
characters get remapped to the "nearest" us-ascii, unaccented character.
Honestly I don't know whether it would be better to always and consistently
retain the true Latin1 character in &#nnnnn; notation. I personally tend in
your direction, better retain the original values.

>So my revised question is: Aside from the bizarre policy for remapping
>(which is definitely a BUG),

OK, maybe a BUG or maybe an unfortunate design decision. I am willing to
hear other opinions on this.

>is the sending of Unicode that's not in the ASCII subset as
>SGML numeric generic entities in application/x-www-form-urlencoded
>(which is not designed to handle anything other than ASCII) to be
>expected from here on?

See http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.3 which gives
the FORM author a method to specify which charsets are acceptable to the
server servicing the ACTION. The user agent is encouraged to choose among
the charsets listed and preferably choose one that is able to hold the user
generated data without applying any additional escaping techniques. I expect
a future version of Internet Explorer to choose only UTF-8 when listed as an
acceptable choice - in the case it detects characters that don't fit in the
FORM page's document charset.
HTML4 does not give a recommendation what to do with characters that still
don't fit in _any_ of the FORM's accept-charset listed charsets or what to
do if there are none listed. Safe, but inconvenient, would be if user agent
prevents any input that would not fit. I am not guaranteeing that future
versions of Internet Explorer will indeed prevent non-fitting input, so for
the time being I do recommend coding for the &#nnnnn; method or specifying
UTF-8 as the FORM page's document charset.

-----Original Message-----
From: Adrian Havill <havill@threeweb.ad.jp>
To: Unicode List <unicode@unicode.org>
Date: Saturday, April 18, 1998 9:02 PM
Subject: It was IE (re: Unicode SGML entities in
application/x-www-form-urlencoded)

Finally found the problem.

IE 4 is outputting SGML entities which map to Unicode when the character
input
into a <FORM> is not in the target character set (such as the one in the
charset
attribute in the Content-Type... either set via a <META> tag or by the HTTP
server).

In other words, if someone types "Nihongo" in Han characters followed by the
U+0100 (Latin Capital Letter A with Macron) {This can be done by using a the
Windows NT "Unicode Character Map" tool and cut-n-paste} into a form that
has
encoding set to "iso-8859-1," it will send the Macroned A out as "&#256;"
and
will encode the "Nihongo" in Shift-JIS on a Japanese version of IE 4.

On an U.S. version, both the "Nihongo" and the Macroned "A" will get SGML
entitified (into three entities), because neither is in the "native"
character
set.

Bizarre because the character set for the page is ISO-8859-1, but it's still
sending MBCS Shift-JIS.

Even more bizarre is that this is NOT done if the character is in
ISO-8859-1.
Instead, if the character is in ISO-8859-1, the character gets "remapped"
into
the nearest ASCII look-alike (C with cedilla gets mapped to "C", E with
acute
gets mapped to "E"), which I feel is _bad_ behavior in terms of I18N.

This will be done (&#xxxxx;) even if the FORM is encoded as
"multipart/form-data"... which is capable of specifying the character set in
the
Content-Type-- which is a shame because the nice thing about the new
encoding
type is that it's supposed to help alleviate the lack of character set info
problem that plagues application/x-www-form-urlencoded.

====

So my revised question is: Aside from the bizarre policy for remapping
(which is
definitely a BUG), is the sending of Unicode that's not in the ASCII subset
as
SGML numeric generic entities in application/x-www-form-urlencoded (which is
not
designed to handle anything other than ASCII) to be expected from here on?
The
HTML 4.0 docs does not seem to get into this, instead advising that
application/x-www-form-urlencoded is designed for ASCII only (although it's
not
used for this) and recommends multipart/form-data for all other character
encodings.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT