Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: Doug Ewell (dewell@adelphia.net)
Date: Tue Dec 07 2004 - 23:42:04 CST

Next message: Doug Ewell: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"

Previous message: John Cowan: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
In reply to: Philippe Verdy: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Next in thread: Doug Ewell: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

> An alternative can then be a mixed encoding selection:
> - choose a legacy encoding that will most often be able to represent
> valid filenames without loss of information (for example ISO-8859-1,
> or Cp1252).
> - encode the filename with it.
> - try to decode it with a *strict* UTF-8 decoder, as if it was UTF-8
> encoded.
> - if there's no failure, then you must reencode the filename with
> UTF-8 instead, even if the result is longer.
> - if the strict UTF-8 decoding fails, you can keep the filename in the
> first 8-bit encoding...
> When parsing files:
> - try decoding filenames with *strict* UTF-8 rules. If this does not
> fail, then the filename was effectively encoded with UTF-8.
> - if the decoding failed, decode the filename with the legacy 8-bit
> encoding.
>
> But even with this scheme, you will find interoperability problems
> because some applications will only expect the legacy encoding, or
> only the UTF-8 encoding, without deciding...

This technique was described as "adaptive UTF-8" by Dan Oscarsson in
August 1998:

http://www.unicode.org/mail-arch/unicode-ml/Archives-Old/UML012/0738.html

although he did not go as far as Philippe did, in actually checking the
"adaptively" encoded string to make sure it would be decoded correctly.

All the same, it was decided not to go this route, partly because the
auto-detection capability of UTF-8 would be lost, partly because having
multiple context-dependent encodings of the same code points would have
been a Bad Thing (<99 C9> could be encoded adaptively but <C9 99> could
not), and partly for the reason Philippe mentions -- most existing
decoders would expect either Latin-1 or UTF-8, and would choke if handed
a mixture of the two.

-Doug Ewell
Fullerton, California
http://users.adelphia.net/~dewell/

Next message: Doug Ewell: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Previous message: John Cowan: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
In reply to: Philippe Verdy: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Next in thread: Doug Ewell: "Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Dec 07 2004 - 23:43:41 CST