Re: German characters not correct in output webform

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Jan 14 2004 - 08:42:11 EST

  • Next message: Kent Karlsson: "RE: Confusion about composition"

    From: "Peter Kirk" <peterkirk@qaya.org>
    > On 13/01/2004 15:59, Philippe Verdy wrote:
    >
    > >From: "Peter Kirk" <peterkirk@qaya.org>
    > >
    > >
    > > ...
    > >
    > >>Is this actually true? Other characters can be entered into an
    > >>ISO-8859-1 form in the format "&#nnn;"; or at least Mozilla 1.5 uses
    > >>this format. I suspect this is what happened to me recently when I typed
    > >>a schwa into a message in the webmail interface of a Yahoo group, and
    > >>this appeared in my mail received from the group as "&#601;" - because
    > >>the message source contained "&amp;#601;". The problem seems to be that
    > >>the process reading the form data was not expecting this format and so
    > >>took the & as a literal rather than as an escape.
    > >>
    > >>
    > >
    > >It's true that you can pre-feed the form data within your HTML page
    encoded
    > >with ISO-8859-1 using numeric character entities to specify
    non-ISO-8859-1
    > >characters. If you try to submit it with a form specifying that it should
    be
    > >encoded with ISO-8859-1, the browser may not notice that this pre-feeded
    > >data (which still appeared correct in the rendered form) was bogous and
    > >normally impossible to encode with ISO-8859-1.
    > >
    > >
    > >
    > Just to clarify: the data I was entering was not bogus, but was exactly
    > what I wanted to enter and was legal content for the e-mail which I
    > wanted to send to the list. The error was at Yahoo, or possibly in my
    > browser, in not supporting the characters which I wanted to use. I was
    > not informed of any restriction or problem.

    Can you exhibit the URL of your entry form or a HTML snapshot of your form
    page? It may reveal if it's a problem in the HTML page itself, which does
    allow prefeeding an entry form with characters that won't be mapped
    correctly with the specified format for submitted data.
    I have seen some references to the new XForms schema, but it is not usable
    in HTML 4, because it requires recoding the ntry form with a <model> section
    (the <form> element is obsolated in XForms).

    I would have prefered seeing a formal proposal on the W3C to specify a XML
    submission format usable in HTML4 entry forms. For now, I think it's a
    violation of the format defined for <form method="POST"> or of the URL
    encoding for <form method="GET"> to use numeric character entities, ad both
    submission formats are not XML. Browsers should inform their users that some
    of their form data cannot be encoded safely in the target charset if this
    specified or implied charset is not a Unicode encoding scheme (UTF-8,
    UTF-16, UTF-32, SCSU) or a Unicode compatible encoding.

    This proposed format would have deprecated the old format for POST data, if
    it had used a well defined and standardized XML schema, immediately
    recognizable in web servers like Apache or script engines like PHP. Bascally
    it should consist of an unordered list of (form input id, form input value)
    pairs, both elements in the pair being codable as text elements or element
    attribute values and accepting numeric character entities, if it can't be
    encoded in the target charset. It would be compatible with XForms by using
    an implicit model, associated to the specified format registered and
    documented by the W3C.

    So instead of using <form> with implicit method="GET" and
    enctype="application/x-www-form-urlencoded>, or <form method="POST"
    enctype="multipart/formdata">, which both assume a default
    accept-charset="UNKNOWN" meaning the charset used to get the HTML document
    containing the form, I would have liked to see:
        <form method="POST" enctype="text/xml-formdata"
    accept-charset="ISO-8859-1">
    which specifies that the server will be able to process form data encoded as
    a XML document conforming to the XML schema specified by the registered MIME
    type "text/xml-formdata", this XML document being preferably encoded with
    the "ISO-8859-1" charset, using XML numeric character entities if needed to
    represent characters that can't fit in ISO-8859-1...

    I don't know if such enctype value is supported in browsers, and if there's
    an agreement about the (quite basic) XML schema to which it should
    correspond. Without it the only solution is that web servers and script
    engines be updated to decode correctly the POST data using the charset
    indicated in its "Content-Type:" header (or headers of each part in case of
    "multipart/formdata"); this is really a problem for HTML form pages not
    encoded with a UTF encoding scheme: Do browsers have to use the
    accept-charset attribute of <form> elements? Are they allowed to switch to
    UTF-8 and specify this encoding in the submitted data in
    "application/x-www-form-urlencoded" or "multipart/formdata" content-types?
    If so, it seems logical that your form processor will see data encoded with
    UTF-8 despite your HTML form page was coded with ISO-8859-1 with a missing
    accept-charset attribute (whose default value is "UNKNOWN", but not
    necessarily the same as the charset used in the HTML form page...).

    ----
    For reference:
        http://www.w3.org/TR/html4/interact/forms.html
    [quote]
    17.3 The FORM element
    <!ELEMENT FORM - - (%block;|SCRIPT)+ -(FORM) -- interactive form -->
    <!ATTLIST FORM
      %attrs;                              -- %coreattrs, %i18n, %events --
      action      %URI;          #REQUIRED -- server-side form handler --
      method      (GET|POST)     GET       -- HTTP method used to submit the
    form--
      enctype     %ContentType;  "application/x-www-form-urlencoded"
      accept      %ContentTypes; #IMPLIED  -- list of MIME types for file
    upload --
      name        CDATA          #IMPLIED  -- name of form for scripting --
      onsubmit    %Script;       #IMPLIED  -- the form was submitted --
      onreset     %Script;       #IMPLIED  -- the form was reset --
      accept-charset %Charsets;  #IMPLIED  -- list of supported charsets --
      >
    Start tag: required, End tag: required
    Attribute definitions
    action = uri [CT]
        This attribute specifies a form processing agent. User agent behavior
    for
        a value other than an HTTP URI is undefined.
    method = get|post [CI]
        This attribute specifies which HTTP method will be used to submit the
        form data set. Possible (case-insensitive) values are "get" (the
    default)
        and "post". See the section on form submission for usage information.
    enctype = content-type [CI]
        This attribute specifies the content type used to submit the form to the
        server (when the value of method is "post"). The default value for this
        attribute is "application/x-www-form-urlencoded". The value
        "multipart/form-data" should be used in combination with the INPUT
        element, type="file".
    accept-charset = charset list [CI]
        This attribute specifies the list of character encodings for input data
    that
        is accepted by the server processing this form. The value is a space-
    and/or
        comma-delimited list of charset values. The client must interpret this
    list as
        an exclusive-or list, i.e., the server is able to accept any single
    character
        encoding per entity received.
        The default value for this attribute is the reserved string "UNKNOWN".
    User
        agents may interpret this value as the character encoding that was used
        to transmit the document containing this FORM element.
    accept = content-type-list [CI]
        This attribute specifies a comma-separated list of content types that a
        server processing this form will handle correctly. User agents may use
        this information to filter out non-conforming files when prompting a
    user
        to select files to be sent to the server (cf. the INPUT element when
        type="file").
    name = cdata [CI]
        This attribute names the element so that it may be referred to from
    style
        sheets or scripts. Note. This attribute has been included for backwards
        compatibility. Applications should use the id attribute to identify
    elements.
    [/quote]
    Clearly there's nothing in this normative reference that allows a browser
    sending numeric any character entities in form data submitted with
    "application/x-www-form-urlencoded" (format specified in the HTTP reference
    RFC 1616 for query strings in URLs) or "multipart/form-data" (format
    specified
    in the MIME multipart specification), but the last sentence in the paragraph
    describing the accept-charset attribute contains the "may" word which by
    normative definition allows a browser to uses another charset than the one
    suggested in the accept-charset attribute of the <form> element...
    


    This archive was generated by hypermail 2.1.5 : Wed Jan 14 2004 - 09:28:46 EST