Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: Doug Ewell (dewell@adelphia.net)
Date: Tue Dec 07 2004 - 23:42:04 CST

  • Next message: Doug Ewell: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"

    Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

    > An alternative can then be a mixed encoding selection:
    > - choose a legacy encoding that will most often be able to represent
    > valid filenames without loss of information (for example ISO-8859-1,
    > or Cp1252).
    > - encode the filename with it.
    > - try to decode it with a *strict* UTF-8 decoder, as if it was UTF-8
    > encoded.
    > - if there's no failure, then you must reencode the filename with
    > UTF-8 instead, even if the result is longer.
    > - if the strict UTF-8 decoding fails, you can keep the filename in the
    > first 8-bit encoding...
    > When parsing files:
    > - try decoding filenames with *strict* UTF-8 rules. If this does not
    > fail, then the filename was effectively encoded with UTF-8.
    > - if the decoding failed, decode the filename with the legacy 8-bit
    > encoding.
    >
    > But even with this scheme, you will find interoperability problems
    > because some applications will only expect the legacy encoding, or
    > only the UTF-8 encoding, without deciding...

    This technique was described as "adaptive UTF-8" by Dan Oscarsson in
    August 1998:

    http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML012/0738.html

    although he did not go as far as Philippe did, in actually checking the
    "adaptively" encoded string to make sure it would be decoded correctly.

    All the same, it was decided not to go this route, partly because the
    auto-detection capability of UTF-8 would be lost, partly because having
    multiple context-dependent encodings of the same code points would have
    been a Bad Thing (<99 C9> could be encoded adaptively but <C9 99> could
    not), and partly for the reason Philippe mentions -- most existing
    decoders would expect either Latin-1 or UTF-8, and would choke if handed
    a mixture of the two.

    -Doug Ewell
     Fullerton, California
     http://users.adelphia.net/~dewell/



    This archive was generated by hypermail 2.1.5 : Tue Dec 07 2004 - 23:43:41 CST