Re: Unicode enabled OCR software

From: Mark E. Shoulson (mark@kli.org)
Date: Tue Jan 31 2006 - 14:42:32 CST

  • Next message: Kent_Spielmann@sil.org: "Re: Unicode enabled OCR software"

    Kent_Spielmann@sil.org wrote:

    >Does anyone know of OCR software solution that permits mapping to the full
    >Unicode character set as output from the character recognition process?
    >This needs to include mapping to base character+combining character
    >combinations.
    >
    >
    Ow. OCR to *full* Unicode sounds like it would have a lot of potential
    problems. Within any given alphabet, you can usually count on letters to
    look somewhat different from each other, but so many Unicode characters
    resemble other ones, how is the program to know which to output? An "A"
    might be Latin, Greek, or Cyrillic, and they'd all look identical (not
    even "similar").

    Spelling dictionaries will help, and some heuristics like "probably the
    letters are all in the same alphabet" (but maybe alphabets might have to
    be defined across blocks, like IPA).

    >We speculate the reason may be one or more of the following:
    > The OCR developers may feel that, if they allow output to other code
    > points, they also need to provide recognition templates for them.
    > The OCR recognition software relies on spell checkers to improve output
    > accuracy and apparently most spell check dictionaries do not allow
    > non-ANSI characters (this is true for the Office 2003 spell checker).
    > There is not enough commercial motivation for providing this capability.
    >
    >
    Certainly sounds plausible. And there's a lot of room between "only
    recognizing ANSI characters" and "full Unicode" output; being able to
    catch most of Latin and also, say, Cyrillic and Greek is easier than
    trying to tease out Arabic (ugh, the bidi issues involved in OCR...)

    ~mark



    This archive was generated by hypermail 2.1.5 : Tue Jan 31 2006 - 14:46:42 CST