RE: Detecting encoding in Plain text

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Tue Jan 13 2004 - 06:00:50 EST

  • Next message: Chris Jacobs: "Re: Chinese rod numerals"

    Jon Hanna wrote:
    > False positives can be caused by the use of U+0000 (which is
    > most often encoded as 0x00) which some applications do use
    > in text files.

    I have never seen such a thing, can you make an example?

    I can't imagine any use for a NULL in a file apart terminating records or
    strings but, of course, a file containing records or string is not what I
    would call a "plain-text file", anyway not a "typical" plain-text file.

    > The method can be used reliably with text files that are
    > guaranteed to contain large amounts of Latin-1

    But the Latin-1 (or even just ASCII) range contains some characters which
    are shared by most languages (space, new line and/or line feed, digits,
    punctuation), so there should be a relatively large amount of Latin-1
    characters in most cases.

    Even scripts which have their own digits or punctuation often prefer
    European digits punctuation, especially in computer usage. E.g., it suffices
    to check a few websites (or even printed matter) in Arabic to see that
    European digits are much more widespread than native digits.

    _ Marco



    This archive was generated by hypermail 2.1.5 : Tue Jan 13 2004 - 06:40:09 EST