On Wed, 20 Mar 2013 20:49:32 -0600
Karl Williamson <public_at_khwilliamson.com> wrote:
> Now back to processing general text. Doing any serious analysis of
> text will require using regular expressions. That means normalizing
> the input, as UTS 18 finally now says.
I think that change may be associated with the fact that what were
intended to be Unicode regular expressions are not in general regular
expressions - strings canonically equivalent to (ab)* are not
recognisable by finite state machines if a and b are indecomposable and
have distinct non-zero canonical combining classes!
> Whatever normalization you
> choose, singleton decompositions are taken.
> That means that ANO TELEIA becomes a MIDDLE DOT, and GREEK QUESTION
> MARK (U+037E) becomes a SEMICOLON (U+003B), among other things. This
> really presents a rather untenable situation for a program. You have
> to normalize, but if you do, you lose critical information.
For linguistic analysis, you need the normalisation appropriate to the
task. This is a case where Unicode normalisation generally throws away
information (namely, how the author views the characters), whereas in
analysing Burmese you may want to ignore the order of non-interacting
medial signs even though they have canonical combining class 0. I have
found it useful to use a fake UnicodeData.txt to perform a non-Unicode
normalisation using what were intended to be routines for performing
Unicode normalisation. Fake decompositions are routinely added to the
standard ones when generating the default collation weights for the
Unicode Collation Algorithm - but there the results still comply with
the principle of canonical equivalence.
However, distinguishing U+00B7 and U+0387 would fail spectacularly
of the text had been converted to form NFC before you received it.
> Further, the code chart glyphs for the ANO TELEIA and the MIDDLE DOT
> differ, see attachment. If they are canonically equivalent, and one
> is a mandatory decomposition of the other, why do they have differing
> glyphs?
Because the codepoints are usually associated with different fonts?
For a more striking example, compare the code chart glyphs for U+2F831,
U+2F832 and U+2F833, which are all canonically equivalent to U+537F.
Richard.
Received on Thu Mar 21 2013 - 17:53:46 CDT
This archive was generated by hypermail 2.2.0 : Thu Mar 21 2013 - 17:53:48 CDT