From: Kent_Spielmann@sil.org
Date: Tue Jan 31 2006 - 12:43:56 CST
Does anyone know of OCR software solution that permits mapping to the full
Unicode character set as output from the character recognition process?
This needs to include mapping to base character+combining character
combinations.
All of the software we have looked (Fine Reader, OmniPage, and Text Bridge)
at can map to only the Unicode characters also defined in a subset of the
ANSI code pages.
We are trying to convert documents in minority languages and as well as
linguistic documentation, and have need for access to a larger set of
lesser-used characters.
We find the situation curious since the reader that we are using (Abbyy
Fine Reader) does output Unicode. It simply limits the selection of output
codepoints to characters previously defined in ANSI. Allowing users to
create custom mappings to "non-ANSI" Unicode codepoints would not seem to
be difficult.
We speculate the reason may be one or more of the following:
The OCR developers may feel that, if they allow output to other code
points, they also need to provide recognition templates for them.
The OCR recognition software relies on spell checkers to improve output
accuracy and apparently most spell check dictionaries do not allow
non-ANSI characters (this is true for the Office 2003 spell checker).
There is not enough commercial motivation for providing this capability.
Kent Spielmann
International Linguistics Department
7500 W. Camp Wisdom Road,
Dallas, TX 75236 USA
Tel: + 1 972 708 7570
This archive was generated by hypermail 2.1.5 : Tue Jan 31 2006 - 12:52:04 CST