From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Aug 10 2005 - 07:52:51 CDT
From: "Samuel Thibault" <samuel.thibault@ens-lyon.org>
> Ritesh, le Wed 10 Aug 2005 12:33:21 +0530, a écrit :
>> Now we have few user who upload a file which can contain English and
>> other language characters(Here it is Arabic).
>
> Doesn't the browser tell the charset of the uploaded file?
Typically no. Not if you are just uploading a plain-text file initially
stored in your filesystem, because the browser will just figure out the MIME
type of the file according the filesystem properties (basically the file
extension, which for plain-text files is typically ".txt" and maps to the
"text/plain" MIME type without charset indication), without even trying to
parse its content (to see if there's a charset "indicator" in the text
file).
What you, Ritesh, need, is a way to make a distinction between a BOM-less
UTF-8 text file and a CP1256 or ISO-Arabic text file. For that you'll need
an "Heuristic", because there's no algorithm. This means that the detection
of the charset will not always return the right answer.
Typically, you can first parse the file to detect if it has a leading BOM.
If there's a UTF-8 or UTF-16BE or ITF-16LE leading BOM, you can be nearly
sure that the encoding is correct, because none of these encoded BOM would
be encoded like the begining of a ISO-Arabic or CP1256 text file.
Then you'll have to check if it is a BOM-less UTF-8 or UTF-16 file: try
decoding the file completely, and if it succeeds with one of these
encodings, the file is most probably encoded with these encoding (the
chances that the answer will be wrong are extremely low, notably if the file
size is long enough and contains enough human language, and not a collection
of symbols and digits with few Arabic characters)
If this now fails, decode it with CP1256. It may fail if there are some
bytes in the 0x80-0x9F range that have no charecter mapping. If this
happens, you may then attempt to decode with ISO-Arabic (it will never fail
becaise the ISO-Arabic charset is complete and has an unambiguous single
character mapping for each possible byte value; however you'll get C1
control characters for the 0x80-0x9F range, and these characters are
typically not part of plain text)
So when you have finished determining the charset, the decoded file will
contain only valid Unicode characters. You'll have to check its internal
syntax to see if control characters are acceptable for your application. If
they are not, then the file is invalid for your application, and probably
not plain-text if if contains C1 controls or C0 controls not in {CR, LF,
TAB, FF, EM}.
Note the EM control character may be presentin plain-text edited with MSDOS
tools. Its presence may indicate that the text file was in fact not encoded
with CP1256 or ISO-Arabic, but with a DOS Arabic codepage, so you may try to
reattempt to decode it with that DOS codepage (the decoding will not fail
because this codepage is complete, like the ISO-Arabic charset). This EM
control character may only be valid and present at end of file, and can
be ignored. It may happen sometimes that this control character was
transcoded into UTF-8 or UTF-16 if the original file was edited under DOS,
and the file was transcoded sometimes in the past. Otherwise the EM
character should be ignored.
This archive was generated by hypermail 2.1.5 : Wed Aug 10 2005 - 07:54:25 CDT