From: Mark Davis (mark.davis@icu-project.org)
Date: Mon Sep 25 2006 - 08:12:37 CST
On 9/24/06, Jukka K. Korpela <jkorpela@cs.tut.fi> wrote:
>
> On Sun, 24 Sep 2006, Doug Ewell wrote:
>
> > A process that claims to be able to "support Unicode"
> > should at least be able to follow the simple rule, "If the file or
> stream
> > starts with EF BB BF, throw them away and treat the remainder of the
> file or
> > stream as UTF-8."
>
> No, that would be incorrect if the character encoding of the data has been
> declared. It would be a mistake to start interpreting the octets of data
> in a manner othen than the declared encoding, at least as long as the data
> is formally correct according to the encoding.
In theory, that's correct. In practice, however, the charset is set
incorrectly so, so often. In a browser, the user can reset the charset
manually if he or she sees that it is wrong. That option is not available to
more mechanical processes like search engines -- there, the process simply
can't afford to always believe the charset parameter(s), any more than it
can always depend on the HTML being valid.
Mark
This archive was generated by hypermail 2.1.5 : Mon Sep 25 2006 - 08:19:34 CST