From: Kent_Spielmann@sil.org
Date: Tue Jan 31 2006 - 15:49:24 CST
"Mark E. Shoulson" <mark@kli.org> wrote on 01/31/2006 02:42:32 PM:
> Ow. OCR to *full* Unicode sounds like it would have a lot of potential
> problems. Within any given alphabet, you can usually count on letters to
> look somewhat different from each other, but so many Unicode characters
> resemble other ones, how is the program to know which to output? An "A"
> might be Latin, Greek, or Cyrillic, and they'd all look identical (not
> even "similar").
I'm not sure I defined the issue well enough. The point is that we want to
define a custom alphabet that contains a limited subset of Unicode
characters some of which are not in any of the ANSI codepages. Although our
software allows us to define a custom alphabet, it will allow only
characters in it that exist in an ANSI codepage.
Case in point:
We are scanning Mixtec data with the following letters which are part of
the official Mixtec alphabet:
ɨ | 0268 | Latin Small Letter I With Stroke
--------+--------------------+------------------------------------------
ɨ̀ | 0268+0300 | Latin Small Letter I With Stroke +
| | Combining Grave Accent
--------+--------------------+------------------------------------------
ɨ́ | 0268+0301 | Latin Small Letter I With Stroke +
| | Combining Acute Accent
--------+--------------------+------------------------------------------
č | 010D | Latin Small Letter C With Caron
--------+--------------------+------------------------------------------
Č | 010C | Latin Capital Letter C With Caron
--------+--------------------+------------------------------------------
ž | 017E | Latin Small Letter Z With Caron
--------+--------------------+------------------------------------------
Ž | 017D | Latin Capital Letter Z With Caron
--------+--------------------+------------------------------------------
ʔ | 0294 | Latin Letter Glottal Stop
--------+--------------------+------------------------------------------
ⁿ | 207F | Superscript Small Letter N
--------+--------------------+------------------------------------------
ʷ | 02B7 | Modifier Small W
None of the above codepoints can be output by our reader engine (nor any we
know of).
Kent
This archive was generated by hypermail 2.1.5 : Tue Jan 31 2006 - 16:09:21 CST