From: Mark E. Shoulson (mark@kli.org)
Date: Tue Jan 31 2006 - 14:42:32 CST
Kent_Spielmann@sil.org wrote:
>Does anyone know of OCR software solution that permits mapping to the full
>Unicode character set as output from the character recognition process?
>This needs to include mapping to base character+combining character
>combinations.
>
>
Ow. OCR to *full* Unicode sounds like it would have a lot of potential
problems. Within any given alphabet, you can usually count on letters to
look somewhat different from each other, but so many Unicode characters
resemble other ones, how is the program to know which to output? An "A"
might be Latin, Greek, or Cyrillic, and they'd all look identical (not
even "similar").
Spelling dictionaries will help, and some heuristics like "probably the
letters are all in the same alphabet" (but maybe alphabets might have to
be defined across blocks, like IPA).
>We speculate the reason may be one or more of the following:
> The OCR developers may feel that, if they allow output to other code
> points, they also need to provide recognition templates for them.
> The OCR recognition software relies on spell checkers to improve output
> accuracy and apparently most spell check dictionaries do not allow
> non-ANSI characters (this is true for the Office 2003 spell checker).
> There is not enough commercial motivation for providing this capability.
>
>
Certainly sounds plausible. And there's a lot of room between "only
recognizing ANSI characters" and "full Unicode" output; being able to
catch most of Latin and also, say, Cyrillic and Greek is easier than
trying to tease out Arabic (ugh, the bidi issues involved in OCR...)
~mark
This archive was generated by hypermail 2.1.5 : Tue Jan 31 2006 - 14:46:42 CST