Unicode SGML entities in application/x-www-form-urlencoded stream

From: Adrian Havill (havill@threeweb.ad.jp)
Date: Sat Apr 18 1998 - 22:48:00 EDT


Every once in a while, we receive data from HTTP HTML <FORM>s (that are blindly
relayed via simple CGI programs) that is in this notation:

&#39740;&#29983;&#30000;&#26179;&#19968;

(the above is a Japanese proper name in Han characters).

However, we've been unable to duplicate this output to figure out which browser
(I guess it pays to record the UserAgent) and under what conditions Unicode is
being output. We've tried forcing kanji into <FORM>s that are in ISO-8859-1,
using GET, POST, etc., but can't find the magic combination. We do suspect that
it's a "fourth generation browser"... meaning Communicator or Explorer 4, as
this output has only begun to appear. We also suspect that it's caused by the
browser being forced to send a character that's not in the character set used to
encode the original HTML page (as normally browsers output in MBCS to pages that
are in Japanese).

We want to add code to our library to handle SGML numeric entities in
ISO-10646/Unicode in browser output to CGIs, so we're interested in how this
works.

Any advice/tips would be appreciated.



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:40 EDT