eric.muller at efele.net
Fri Sep 16 10:47:27 CDT 2016
On 9/16/2016 8:30 AM, Janusz S. Bien wrote:
> Quote/Cytat - Eric Muller <eric.muller at efele.net> (pią, 16 wrz 2016,
>> On 9/16/2016 6:52 AM, Janusz S. Bień wrote:
>>> (when working on a corpus of historical Polish we
>>> noticed some cases where standard Unicode equivalence was not
>> I'm very interested to know more about those cases.
> For our search engine we were unable to use compatibility equivalence
> "out of the box" for splitting the ligature because it also converted
> long s to short s while we wanted to preserve the distinction.
I am interested in the problems with *canonical* equivalence. I thought
that you were talking about those before.
Compatibility equivalence is a completely different beast. It is, IMHO,
too coarse a tool and best forgotten. For any particular task, it's
typically doing too much (e.g. long/short s folding in your case) and
too little (not everything you need). There was an attempt at improving
the situation, by providing a whole bunch of fine grained, targeted
transformations (http://www.unicode.org/reports/tr30/), but that did not
More information about the Unicode