Re: Cp1256 (Windows Arabic) Characters not supported by UTF8

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Aug 10 2005 - 07:52:51 CDT

  • Next message: Philippe Verdy: "Re: "minus" usage"

    From: "Samuel Thibault" <samuel.thibault@ens-lyon.org>
    > Ritesh, le Wed 10 Aug 2005 12:33:21 +0530, a écrit :
    >> Now we have few user who upload a file which can contain English and
    >> other language characters(Here it is Arabic).
    >
    > Doesn't the browser tell the charset of the uploaded file?

    Typically no. Not if you are just uploading a plain-text file initially
    stored in your filesystem, because the browser will just figure out the MIME
    type of the file according the filesystem properties (basically the file
    extension, which for plain-text files is typically ".txt" and maps to the
    "text/plain" MIME type without charset indication), without even trying to
    parse its content (to see if there's a charset "indicator" in the text
    file).

    What you, Ritesh, need, is a way to make a distinction between a BOM-less
    UTF-8 text file and a CP1256 or ISO-Arabic text file. For that you'll need
    an "Heuristic", because there's no algorithm. This means that the detection
    of the charset will not always return the right answer.

    Typically, you can first parse the file to detect if it has a leading BOM.
    If there's a UTF-8 or UTF-16BE or ITF-16LE leading BOM, you can be nearly
    sure that the encoding is correct, because none of these encoded BOM would
    be encoded like the begining of a ISO-Arabic or CP1256 text file.

    Then you'll have to check if it is a BOM-less UTF-8 or UTF-16 file: try
    decoding the file completely, and if it succeeds with one of these
    encodings, the file is most probably encoded with these encoding (the
    chances that the answer will be wrong are extremely low, notably if the file
    size is long enough and contains enough human language, and not a collection
    of symbols and digits with few Arabic characters)

    If this now fails, decode it with CP1256. It may fail if there are some
    bytes in the 0x80-0x9F range that have no charecter mapping. If this
    happens, you may then attempt to decode with ISO-Arabic (it will never fail
    becaise the ISO-Arabic charset is complete and has an unambiguous single
    character mapping for each possible byte value; however you'll get C1
    control characters for the 0x80-0x9F range, and these characters are
    typically not part of plain text)

    So when you have finished determining the charset, the decoded file will
    contain only valid Unicode characters. You'll have to check its internal
    syntax to see if control characters are acceptable for your application. If
    they are not, then the file is invalid for your application, and probably
    not plain-text if if contains C1 controls or C0 controls not in {CR, LF,
    TAB, FF, EM}.

    Note the EM control character may be presentin plain-text edited with MSDOS
    tools. Its presence may indicate that the text file was in fact not encoded
    with CP1256 or ISO-Arabic, but with a DOS Arabic codepage, so you may try to
    reattempt to decode it with that DOS codepage (the decoding will not fail
    because this codepage is complete, like the ISO-Arabic charset). This EM
    control character may only be valid and present at end of file, and can
    be ignored. It may happen sometimes that this control character was
    transcoded into UTF-8 or UTF-16 if the original file was edited under DOS,
    and the file was transcoded sometimes in the past. Otherwise the EM
    character should be ignored.



    This archive was generated by hypermail 2.1.5 : Wed Aug 10 2005 - 07:54:25 CDT