RE: Detecting encoding in Plain text

From: Tom Emerson (Tree@basistech.com)
Date: Mon Jan 12 2004 - 11:57:47 EST

  • Next message: Markus Scherer: "Re: Confusion about composition"

    Perhaps a meta question is this: how often are you going to encounter
    unBOMed UTF-32 or UTF-16 text? It's pretty rare --- certainly I've never
    seen it during the development of our language/encoding identifier.

    Sure, it's an interesting thought problem, but it doesn't happen.
    And fortunately detecting UTF-8 is relatively easy.

    The real problem is differentiating between the ISO 8859-x family and
    EUC-CN vs. EUC-KR. These are wondefully ambiguous.

    The key to doing this right is having _a_lot_ of valid training data.
    You also have to deal with oddities of language: I tried one open
    source implementation of the Cavnar and Trenkel algorithm THAT CLAIMED
    THAT SHOUTED ENGLISH WAS ACTUALLY CZECH.

    It's difficult to separate the language detection from the encoding
    Detection when dealing with non-Unicode text.

        -tree

    --
    Tom Emerson                                          Basis Technology Corp.
    Software Architect                                 http://www.basistech.com
      "Beware the lollipop of mediocrity: lick it once and you suck forever" 
    


    This archive was generated by hypermail 2.1.5 : Mon Jan 12 2004 - 12:43:50 EST