From: Tom Emerson (tree@basistech.com)
Date: Thu Aug 11 2005 - 08:30:32 CDT
Philippe Verdy writes:
> For the case of Arabic, the first indicator is effectively the alphabet, but
> I think that there are similar usage pattern that helps making distinction
> between Arabic and Urdu. Anyway, the various encodings used for the Arabic
> script will be easily determined by letter occurences statistics.
Indeed: I wrote a detecter for Arabic encodings that did exactly this,
in that it could differentiate between ISO-8859-6, Windows CP1256,
Unicode transformation formats, and ASMO-449. In this particular
application it was known a priori that the text was Arabic, just not
the encoding.
I've found that unigram frequencies are usually enough to
differentiate Arabic from Persian, and bigram frequencies enough to
differentiate Arabic, Persian, Urdu, Pashto, and Kurdish when using an
encoding that supports all of the writing systems. I have not looked
at Uighur, though I expect bigrams will be enough there as well.
One problem I've experienced with Urdu is the large number of
font-specific encodings that are out there: historically few pages
have used Unicode opting instead for a custom font and unique
encoding, that may include presentation forms. This is when using
metadata, either declared lang attributes, font names, or URL
information, is absolutely necessary to identify the possible ranges
of encodings.
Peace,
-tree
-- Tom Emerson Basis Technology Corp. Software Architect http://www.basistech.com "You can't fake quality any more than you can fake a good meal." (W.S.B.)
This archive was generated by hypermail 2.1.5 : Thu Aug 11 2005 - 08:31:28 CDT