From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Tue Jan 13 2004 - 06:00:50 EST
Jon Hanna wrote:
> False positives can be caused by the use of U+0000 (which is
> most often encoded as 0x00) which some applications do use
> in text files.
I have never seen such a thing, can you make an example?
I can't imagine any use for a NULL in a file apart terminating records or
strings but, of course, a file containing records or string is not what I
would call a "plain-text file", anyway not a "typical" plain-text file.
> The method can be used reliably with text files that are
> guaranteed to contain large amounts of Latin-1
But the Latin-1 (or even just ASCII) range contains some characters which
are shared by most languages (space, new line and/or line feed, digits,
punctuation), so there should be a relatively large amount of Latin-1
characters in most cases.
Even scripts which have their own digits or punctuation often prefer
European digits punctuation, especially in computer usage. E.g., it suffices
to check a few websites (or even printed matter) in Arabic to see that
European digits are much more widespread than native digits.
_ Marco
This archive was generated by hypermail 2.1.5 : Tue Jan 13 2004 - 06:40:09 EST