From: David Starner (prosfilaes@gmail.com)
Date: Thu Oct 25 2007 - 05:41:19 CDT
On 10/25/07, Don Osborn <dzo@bisharat.net> wrote:
> I suspect this is a general problem going back to a lack of OCR that
> recognizes extended characters, or at least the scanning of this particular
> book did not recognize the characters.
It's hard to get good OCR without knowing what characters you're
looking for. O with ~ above, after real life typesetting and scanning,
could be a macron or circumflex or a tilde above, or a bare O with a
smear of ink.
> Is anyone aware of an OCR system that recognizes extended Latin characters
> from say Extended A&B, IPA, and Extended Additional ranges? That is for any
> language (orthography) including these characters?
ABBYY offers most of Extended A and some of Extended B and Additional.
The list of supported languages is
<http://www.abbyy.com/finereader8/?param=44927>, which should map to
the list of supported characters. It would be hard to impossible to
create and test an OCR without a substantial corpus of material using
a character; I suspect many languages are on ABBYY's list only because
the orthography is a subset of those supported for other reasons.
> I've been discussing scanning of African language materials as part of books
> online programs. The good news is a little of that has been started, but it
> is definitely not good news if the scanning is being done (in some or all
> cases) without the right OCR.
Why? Once you have the scans, you can always reOCR. There's no way
that any automated scanning program is going to handle unusual text
like African language materials as well as someone who's focused and
familiar with them.
This archive was generated by hypermail 2.1.5 : Thu Oct 25 2007 - 05:43:43 CDT