From: Antoine Leca (Antoine10646@leca-marti.org)
Date: Mon Jun 06 2005 - 03:54:53 CDT
On Saturday, June 4th, 2005 21:29Z Doug Ewell wrote:
> Lasse Kärkkäinen / Tronic [email deleted] wrote:
>
>> In practice the autodetection by malformed UTF-8 seems to
>> seems to work quite reliably and it very rarely misdetects legacy
>> 8-bit as UTF-8 (in fact, I have never seen this happen).
>
> It's a contrived example, but the string "NESTLÉ™" encoded in Latin-1
It is a minor nit, but ™ (U+2122) does not appear in my Latin-1 (ISO/IEC
8859-1:1998) charts; of course, this character appears at position 9/9 in
the Windows 1250, 1252, 1254, 1257, 1258 codepages (and also in some others,
but those do not have É at 12/9).
> consists of the bytes 4E 45 53 54 4C C9 99. This is a valid UTF-8
> string, and SC UniPad detects it as such and renders it as "NESTLə".
Also, I understand that Lasse's argument was that a text file which shows
/zero/ malformations while decoding as UTF-8 is likely to be in this
encoding; I understand that examples like NESTLÉ™ are possible inside purely
English texts (that is, without any other "accentuated" characters), but I
only shall highlight that similar examples should be reduced in number (yes,
I noticed Doug wrote "contrived" above.)
OTOH, I do not have a very clear idea of the overhead of the full search for
malformed characters over the whole file when encoding is otherwise unknown
(particularly since such algorithm is to be applied in environments where
UTF-8 is the most likely encoding, so this means 100% search for every
"good" file while a much shorter scan for a "bad" file.) I only know this is
not likely to work for a pipe-oriented program, like the traditional Unix
tools.
Antoine
This archive was generated by hypermail 2.1.5 : Mon Jun 06 2005 - 03:57:25 CDT