Re: FYI: Google blog on Unicode

From: Jeroen Ruigrok van der Werven (
Date: Fri Jan 29 2010 - 02:57:45 CST

  • Next message: Andrew West: "Re: Transform for Hans with multiple pronunciations"

    -On [20100129 06:53], Simon Montagu ( wrote:
    >What exactly is this counting? Encodings declared internally in
    >web-pages? Encodings declared in HTTP headers? Encodings determined by
    >auto-detection? Some combination of the above?

    The article states: "This graph is from Google internal data, based on our
    indexing of web pages, and thus may vary somewhat from what other search
    engines find."

    As we all know, there's a lot of pages that are either using a wrong
    encoding in the preamble or in the headers. So I guess Google uses some
    simple algorithm that looks at what the page says, what the server says and
    whether or not that matches whatever it encounters on the page itself and
    adjusts it as necessary.
    Would not make much sense to store mojibake.

    Jeroen Ruigrok van der Werven <asmodai(-at-)> / asmodai
    イェルーン ラウフロック ヴァン デル ウェルヴェン | | GPG: 2EAC625B
    Though this be madness, yet there is a method in it...

    This archive was generated by hypermail 2.1.5 : Fri Jan 29 2010 - 03:01:43 CST