From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun May 23 2004 - 09:16:50 CDT
From: "Towheed Chowdhury" <nsumba@hotmail.com>
> How bangla ocr can be developed using current unicode?
ISO/IEC 10646 and Unicode are just standard for character encoding, not for
their rendering and presentation.
OCR is a difficult problem, but it has nothing in common with characters
encoding, as it is an analysis of glyphs.
Generally, good OCR recognition is difficult to automate without specific fonts
with simplified or slightly altered (but still readable) glyphs.
This is not a problem of Unicode.
What Unicode has done is only to add some characters that were used in the OCR
context (such as symbols on checks, that were created and printed specially for
OCR systems, but had no prior meaning in the linguistic and plain-text area: in
Unicode these special glyphs are coded as distinctive symbols with their own
code points.
OCR already has difficulties to recognize accents on modern Latin, Greek or
Cyrillic letters, and it does not work well with other scripts (it works with
unpointed Hebrew, but fails with Arabic due to the complex joining behavior and
too small glyphic differences between glyphs in the most widely used typographic
variants of the Arabic script.)
I don't know if there has been attempt to recognize Devanagari in India.
Hiragana and Katakan may work well in OCR, but generally Japanese texts contain
lots of Han ideographs that are very difficult to recognize with OCR due to
their graphic complexity.
May be there's OCR working with Hangul basic Jamos (written linerarily, instead
of with syllabic squares).
In all these case, the target encoding when parsing a scanned image of a text is
not the issue, as the difficulty is in recognizing abstract characters from many
distinct glyph shapes that will alwyas exhibit slight variations when scanned
from a printed paper.
So you want to search in India if there exists some works to read Devanagari
printed texts with OCR (Devenagari is difficult to parse too, like Arabic,
because glyphs are most often joined, and this creates difficulties to separate
letters or letter parts.
This archive was generated by hypermail 2.1.5 : Sun May 23 2004 - 09:18:10 CDT