From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Jan 14 2004 - 09:59:08 EST
Yahoo! Groups : aramaic Post MessageFrom: Peter Kirk
> Well, attached is a Yahoo groups form (saved by my browser) similar to
> the one which caused me problems.
The "Reply" form in Yahoo Groups is coded in "windows-1252".
It uses the following form declaration:
<form method="post" action="/group/...">
If it had not indicated a value for the "method" attribute, the submission
method would have been "GET" by default. Instead, here, the submission will
be a POSTed entity.
It does not indicate a value for the missing "enctype" attribute, so the
form data is encoded using "Content-Type: application/x-www-form-urlencoded"
(the browser should indicate this Content-Type: header explicitly for the
POSTed entity).
It does not indicate a value for the missing "accept-charset" attribute, so
the browser is expected to use accept-charset="UNKNOWN", and, as specified
in the HTML reference, the browser "may" use the charset used on the HTML
form page, i.e. "windows-1252". The browser is still not allowed to encode
non-Windows-1252 characters that is part of the form data using numeric
character entities (this is not supported by the
"application/x-www-form-urlencoded" content-type, which just consists in
creating a "&"-separated list of "name=value" pairs, where the bytes
encoding the characters present in each "name" or "value" and are not URL
safe, should be coded with %XX triplets for each such coding byte.
The indicated submission format does not allow sending something else than
windows-1252 characters, and so any character in the form data which does
not exist in this charset should be detected and rejected by the browser,
which should ask the user to modify the form data or to accept that some
characters will be replaced by '?' once converted to windows-1252. The other
solution would be to use a UTF-8 encoding (which is the one recommanded for
URLs) instead of windows-1252 prior to performing the URL-encoding (this is
what should be done, but the missing "accept-charset" attribute which means
UNKNOWN is not clear about what should be done by browsers, notably because
the GET method does not allow specifying explicitly the charset used to
create the URL-encoded query string). But as we are using a POST method,
there'a an attached entity with the HTTP POST request.
This entity created by the browser should then specify the charset actually
used:
Content-Type: application/x-www-form-urlencoded; charset=windows-1252
(if it uses the suggestion given by the HTML4 reference of using the same
charset as the HTML page), or:
Content-Type: application/x-www-form-urlencoded; charset=UTF-8
Then the entity body should consist in the URL-encoding of a "&" separated
list of "name=value" pairs encoded with the charset indicated by the
Content-Type header above.
If the browser chooses the first suggestion, then it won't be able to encode
any non-Windows-12532 character. But the browser can still use the second
solution without even needing any numeric character entities (which are only
needed within XML/HTML/HTML documents, but have no meaning in a
"application/x-www-form-urlencoded" document.
There is NOTHING in this form that allows a browser to use a numeric
character entity "ə". This is true even if the form data present in the
HTML form page was feeded with numeric characters entities like "ə"
which are supposed to encode a character and not the 6-character strings
"ə".
Note that the Yahoo reply form uses this element to feed the reply text:
<textarea name="message" rows=20 cols=70 wrap="hard">content of the
message</textarea>
where the "cntent of the message" will contain probably numeric or named
character entities like "> " at the beginning of each quoted line in the
initial reply text. If there's a "ə" there, it means a single character
that is part of the displayed initial text, and that you browser should
display correctly within the rendered form. However if your browser will
submit the form using the "windows-1252" charset, it won't be able to send
it correctly as the submission format is
"application/x-www-form-urlencoded". So the browser should either ask to the
user to edit the message until this non-windows-1252 is removed or replaced,
or it should ask the user the permission to replace it with "?".
If your browser silently encodes it with a numeric character reference, this
violates all standards. In this case, this is a bug in the browser, which
should have better used silently the UTF-8 encoding if the browser does not
want to bother the user with the permission to replace characters with "?".
The alert prompt however should be displayed by the browser before
submitting the form data, if the form had specified an "accept-charset"
attribute specifying the "windows-1252" charset explicitly and exclusively
without allowing "UTF-8" (because in this case the browser will not have the
"UNKNOWN" default value for the missing "accept-charset" attribute, which
is, in my opinion, the only case where a charset suggested by the HTML form
page encoding may be silently replaced by another, preferably UTF-8 as it
keeps all characters present in the form data).
This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 10:40:08 EST