Re: Cp1256 (Windows Arabic) Characters not supported by UTF8

From: Tom Emerson (tree@basistech.com)
Date: Thu Aug 11 2005 - 08:30:32 CDT

Next message: Doug Ewell: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"

Previous message: Philippe Verdy: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
In reply to: Philippe Verdy: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Next in thread: Doug Ewell: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Reply: Doug Ewell: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Reply: Patrick Andries: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Reply: Jony Rosenne: "RE: Charset determination (was: Cp1256 (Windows Arabic) Characters not supported by UTF8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Philippe Verdy writes:
> For the case of Arabic, the first indicator is effectively the alphabet, but
> I think that there are similar usage pattern that helps making distinction
> between Arabic and Urdu. Anyway, the various encodings used for the Arabic
> script will be easily determined by letter occurences statistics.

Indeed: I wrote a detecter for Arabic encodings that did exactly this,
in that it could differentiate between ISO-8859-6, Windows CP1256,
Unicode transformation formats, and ASMO-449. In this particular
application it was known a priori that the text was Arabic, just not
the encoding.

I've found that unigram frequencies are usually enough to
differentiate Arabic from Persian, and bigram frequencies enough to
differentiate Arabic, Persian, Urdu, Pashto, and Kurdish when using an
encoding that supports all of the writing systems. I have not looked
at Uighur, though I expect bigrams will be enough there as well.

One problem I've experienced with Urdu is the large number of
font-specific encodings that are out there: historically few pages
have used Unicode opting instead for a custom font and unique
encoding, that may include presentation forms. This is when using
metadata, either declared lang attributes, font names, or URL
information, is absolutely necessary to identify the possible ranges
of encodings.

Peace,

-tree

-- 
Tom Emerson                                          Basis Technology Corp.
Software Architect                                 http://www.basistech.com
 "You can't fake quality any more than you can fake a good meal." (W.S.B.)

Next message: Doug Ewell: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Previous message: Philippe Verdy: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
In reply to: Philippe Verdy: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Next in thread: Doug Ewell: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Reply: Doug Ewell: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Reply: Patrick Andries: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Reply: Jony Rosenne: "RE: Charset determination (was: Cp1256 (Windows Arabic) Characters not supported by UTF8)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Aug 11 2005 - 08:31:28 CDT