From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Tue Dec 02 2003 - 12:18:35 EST
Peter Jacobi wrote:
> Markus Scherer <markus.scherer@jtcsv.com> wrote:
>
>>ICU 2.8 has the ability to handle m:n character conversion mappings driven
>>by simple lines in
>>Unicode conversion tables (text files).
>
> That's a nice coincidence, to have this feature. I was wondering
> if this would enable transcoding from legacy Tamil charsets (in visual
> glyph order, like Thai) to Unicode.
Possible, but this is "just" m:n character conversion. This feature does not add arbitrary text
reordering. If you can achieve what you need with a set of m:n mappings, then you can use it by itself.
Otherwise you would have to do line/paragraph chunking and use, for example, the ICU Transliterator
classes for arbitrary Unicode-to-Unicode transforms after converting to or before converting out of
Unicode.
> I've looked at the example data files for the m:n mappings but
> it's still opaque to me, what hat to go in the headers. Is there a
> point to start reading from to gain further insights?
There will be by the time ICU 2.8 is released, and it will be in the User Guide. Sorry for not
having written that yet.
However, there is actually nothing you need to do in the header. The makeconv tool will detect that
you have multiple code points and/or multiple complete codepage character byte sequences and
automatically put such mappings into an appropriate data structure. This is possible because it
knows the structure of the codepage from the already necessary header information. (The structure of
Unicode is known anyway, and trivial in .ucm files where code points are listed.)
> I'm especially wondering, whether the converter by default will
> take the longest matching entry in an m:n table or whether
> the sequence of entries is significant. (Something must be done
> to e.g. disambiguate keLa from kau).
The sequence of entries is not significant. makeconv will sort the mappings internally for
processing before the binary table is written.
The converter must and will use the longest match - otherwise it would not be able to handle Ka vs.
Ka+semi-voiced-mark in the Japanese table.
For more contrived examples, see the test files test3.ucm and test4.ucm in icu/source/test/testdata/
Best regards,
markus
This archive was generated by hypermail 2.1.5 : Tue Dec 02 2003 - 13:15:07 EST