RE: Charset determination (was: Cp1256 (Windows Arabic) Characters not supported by UTF8)

From: Jony Rosenne (rosennej@qsm.co.il)
Date: Thu Aug 11 2005 - 15:02:51 CDT

  • Next message: Michael Everson: "HTML notation"

    I found that simple digram analysis is sufficient to distinguish between
    English and several encodings of Hebrew.

    Jony

    > -----Original Message-----
    > From: unicode-bounce@unicode.org
    > [mailto:unicode-bounce@unicode.org] On Behalf Of Tom Emerson
    > Sent: Thursday, August 11, 2005 3:31 PM
    > To: Philippe Verdy
    > Cc: Doug Ewell; Unicode Mailing List
    > Subject: Re: Cp1256 (Windows Arabic) Characters not supported by UTF8
    >
    >
    > Philippe Verdy writes:
    > > For the case of Arabic, the first indicator is effectively
    > the alphabet, but
    > > I think that there are similar usage pattern that helps
    > making distinction
    > > between Arabic and Urdu. Anyway, the various encodings used
    > for the Arabic
    > > script will be easily determined by letter occurences statistics.
    >
    > Indeed: I wrote a detecter for Arabic encodings that did exactly this,
    > in that it could differentiate between ISO-8859-6, Windows CP1256,
    > Unicode transformation formats, and ASMO-449. In this particular
    > application it was known a priori that the text was Arabic, just not
    > the encoding.
    >
    > I've found that unigram frequencies are usually enough to
    > differentiate Arabic from Persian, and bigram frequencies enough to
    > differentiate Arabic, Persian, Urdu, Pashto, and Kurdish when using an
    > encoding that supports all of the writing systems. I have not looked
    > at Uighur, though I expect bigrams will be enough there as well.
    >
    > One problem I've experienced with Urdu is the large number of
    > font-specific encodings that are out there: historically few pages
    > have used Unicode opting instead for a custom font and unique
    > encoding, that may include presentation forms. This is when using
    > metadata, either declared lang attributes, font names, or URL
    > information, is absolutely necessary to identify the possible ranges
    > of encodings.
    >
    > Peace,
    >
    > -tree
    >
    > --
    > Tom Emerson Basis
    > Technology Corp.
    > Software Architect
    > http://www.basistech.com
    > "You can't fake quality any more than you can fake a good
    > meal." (W.S.B.)
    >
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Thu Aug 11 2005 - 14:04:56 CDT