Re: German characters not correct in output webform

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Jan 13 2004 - 18:59:01 EST

  • Next message: David J. Perry: "Panther PUA behavior (was RE: Cuneiform - Dynamic vs. Static)"

    From: "Peter Kirk" <peterkirk@qaya.org>
    > On 13/01/2004 13:35, Philippe Verdy wrote:
    >
    > > ...
    > >
    > >If your form page uses ISO-8859-1, then specify explicitly the ISO-8859-1
    > >encoding as the one to use for submitting forms, as an explicit attribute
    of
    > >your <form> element. But then visitors won't be able to send other
    > >characters
    > >than ISO-8859-1 in their form data, whever the form method is GET with
    > >URL-encoding, or POST in standard form-data format.
    > >
    > >
    > Is this actually true? Other characters can be entered into an
    > ISO-8859-1 form in the format "&#nnn;"; or at least Mozilla 1.5 uses
    > this format. I suspect this is what happened to me recently when I typed
    > a schwa into a message in the webmail interface of a Yahoo group, and
    > this appeared in my mail received from the group as "&#601;" - because
    > the message source contained "&amp;#601;". The problem seems to be that
    > the process reading the form data was not expecting this format and so
    > took the & as a literal rather than as an escape.

    It's true that you can pre-feed the form data within your HTML page encoded
    with ISO-8859-1 using numeric character entities to specify non-ISO-8859-1
    characters. If you try to submit it with a form specifying that it should be
    encoded with ISO-8859-1, the browser may not notice that this pre-feeded
    data (which still appeared correct in the rendered form) was bogous and
    normally impossible to encode with ISO-8859-1.

    What browsers do when they find form data which should not be encodable with
    the specified charset is still unpredictable. Normally the form data in the
    browser
    should be reencoded in the specified encoding. But the browser should
    refllect
    immediately to the user that some pre-feeded data in the form is bogous and
    some characters will immediately appear as "?". If the browser does not do
    that,
    because it prefers to render the form even with its bogous data impossible
    to
    submit as is, then the browser should check that the edited form data can be
    safely encoded into the target encoding specified in the form, or the
    encoding
    of the HTML page if it is not specified.

    Most HTML forms I have seen nearly never specify the encoding for submitting
    form data. So most browsers assume that form data uses the same encoding as
    the HTML page, even if there are numeric character references.

    But your claim that a browser would send form data containing numeric
    character
    references is wrong here: it violates the format needed for forms submitted
    by "GET" method (should be UTF-8 unless something else is specified or the
    HTML form
    is not encoded with UTF-8, and then URL-encoded), or "POST" method.
    I don't know which other of these two submission formats are supported by
    browsers, but I think that browsers should now adopt some XML format for
    form data submitted by "POST". This way, browsers will be able to use
    numeric
    cahracter references for characters not supported in the selected target
    encoding.

    As UTF-8 is also the default encoding for XML files, browsers would in fact
    not
    need to specify it in the XML declaration of their POST'ed form data
    document.

    Is there now a defined schema for sending POST data with a registered
    media-type supported by browsers and that could be specified as the
    format attribute of the HTML form? Will Apache or script processors like PHP
    support this new XML-formated form data, instead of the legacy URL-formatted
    data and the poor, INI-like, POST variable assignments?

    Browsers that don't support the new format would still use the default
    format for
    GET and POST, but there, it should be impossible to encode all characters if
    the
    target submission encoding is not UTF-8. Such impossibility to encode these
    characters properly in the submitted form data should be signaled to the
    user,
    instead of being sent unreliably and invisibly. I think it's a deficiency of
    browsers,
    and something that the W3C has not specified with enough precision so that
    it
    could be corrected in Internet Explorer-based and Mozilla Gecko-based
    explorers
    and in Opera (which are now more than 98% of the total browser market).



    This archive was generated by hypermail 2.1.5 : Tue Jan 13 2004 - 19:32:59 EST