Re: Cp1256 (Windows Arabic) Characters not supported by UTF8

From: Tom Emerson (tree@basistech.com)
Date: Wed Aug 10 2005 - 14:21:53 CDT

  • Next message: Philippe Verdy: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"

    Philippe Verdy writes:
    > This is absolutely not needed for a charset detector (i.e. the detection of
    > the encoding used to serialize the text). HTML escapes are perfectly valid
    > in HTML, and even if they refer to non Latin-1 characters, this does not
    > change the fact that the page remains encoded in ISO-8859-1.
    >
    > You don't need to take HTML escapes into account with regards of which
    > encoding is used, because these escapes are independant of the actual
    > encoding used.

    Agreed. But if you are interested in the langauge of the page as well
    as the encoding, which some applications do care about, then you have
    to take these into account. And, as I said, building a model that
    accounts for language as well as encoding can help differentiate the
    various Latin-n versions.

    -- 
    Tom Emerson                                          Basis Technology Corp.
    Software Architect                                 http://www.basistech.com
     "You can't fake quality any more than you can fake a good meal." (W.S.B.)
    


    This archive was generated by hypermail 2.1.5 : Wed Aug 10 2005 - 14:22:41 CDT