RE: Charset determination (was: Cp1256 (Windows Arabic) Characters not supported by UTF8)

From: Jony Rosenne (rosennej@qsm.co.il)
Date: Thu Aug 11 2005 - 15:02:51 CDT

Next message: Michael Everson: "HTML notation"

Previous message: Andy Heninger: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
In reply to: Tom Emerson: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Next in thread: Doug Ewell: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I found that simple digram analysis is sufficient to distinguish between
English and several encodings of Hebrew.

Jony

> -----Original Message-----
> From: unicode-bounce@unicode.org
> [mailto:unicode-bounce@unicode.org] On Behalf Of Tom Emerson
> Sent: Thursday, August 11, 2005 3:31 PM
> To: Philippe Verdy
> Cc: Doug Ewell; Unicode Mailing List
> Subject: Re: Cp1256 (Windows Arabic) Characters not supported by UTF8
>
>
> Philippe Verdy writes:
> > For the case of Arabic, the first indicator is effectively
> the alphabet, but
> > I think that there are similar usage pattern that helps
> making distinction
> > between Arabic and Urdu. Anyway, the various encodings used
> for the Arabic
> > script will be easily determined by letter occurences statistics.
>
> Indeed: I wrote a detecter for Arabic encodings that did exactly this,
> in that it could differentiate between ISO-8859-6, Windows CP1256,
> Unicode transformation formats, and ASMO-449. In this particular
> application it was known a priori that the text was Arabic, just not
> the encoding.
>
> I've found that unigram frequencies are usually enough to
> differentiate Arabic from Persian, and bigram frequencies enough to
> differentiate Arabic, Persian, Urdu, Pashto, and Kurdish when using an
> encoding that supports all of the writing systems. I have not looked
> at Uighur, though I expect bigrams will be enough there as well.
>
> One problem I've experienced with Urdu is the large number of
> font-specific encodings that are out there: historically few pages
> have used Unicode opting instead for a custom font and unique
> encoding, that may include presentation forms. This is when using
> metadata, either declared lang attributes, font names, or URL
> information, is absolutely necessary to identify the possible ranges
> of encodings.
>
> Peace,
>
> -tree
>
> --
> Tom Emerson Basis
> Technology Corp.
> Software Architect
> http://www.basistech.com
> "You can't fake quality any more than you can fake a good
> meal." (W.S.B.)
>
>
>
>

Next message: Michael Everson: "HTML notation"
Previous message: Andy Heninger: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
In reply to: Tom Emerson: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Next in thread: Doug Ewell: "Re: Cp1256 (Windows Arabic) Characters not supported by UTF8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Aug 11 2005 - 14:04:56 CDT