Re: Character folding in text editors from Asmus Freytag (t) on 2016-02-21 (Unicode Mail List Archive)

From: Asmus Freytag (t) <asmus-inc_at_ix.netcom.com>
Date: Sun, 21 Feb 2016 10:32:15 -0800

On 2/21/2016 8:22 AM, Eli Zaretskii wrote:

From: "Asmus Freytag (t)" <asmus-inc@ix.netcom.com>
Date: Sat, 20 Feb 2016 14:10:04 -0800

What about language-independent character-folding: where in the
Unicode database is the data for that?

Unicode, even CLDR, doesn't nearly have enough data for the purpose.

This seems to contradict what others said: they said CLDR includes the
necessary data.  What is missing from CLDR, and how bad will the
omissions affect searching?

Depends what you are searching for.

(and as a corollary of what Elias points out, it's likely to annoy users of every language, in that it would fold essential and non-essential distinctions indiscriminately).

Users can easily turn the folding off if they don't like it or if it
gets in the way.

Depends. If a language has a set of important distinctions but text (for users working in that language) also contains noncritical distinctions, the inability to ignore just the latter would be annoying.

There are scenarios where the approximation may not matter.

Also, the sorting order for some languages is radically distinct from the "generic" one. So a language-independent folding based on generic sorting order isn't going to be ideal.


The important question is: will Emacs with this feature be more or
less useful than without it?  Another important question is whether
character folding in searches should be turned on or off by default.
IOW, should we expect more users wanting to turn it off than on?

For languages like English, folding accents by default works really well, unless someone tries to find foreign words in English text... but that would be taken care of by making the default overridable.

However, for other languages, it gives very strange (annoying) results - for at least *some* words but might be useful for some cases. Users might want to disable that default (or invert it) permanently.


AFAIU, the very least that should be provided is being able to find
decomposed characters when a composed one is searched for.  The data
for this, AFAIU, is in UnicodeData.txt in the form of the canonical
decompositions.  Is this correct?

That's generally useful, because these cases represent two encodings hat are intentionally equivalent.

none has seen folding of diacritics as useful

Really?  So you are saying that, based on your experience, being able
to ignore diacritics in searches is not a useful feature?

No, just that there are areas of application where folding all diacritics isn't useful (remember, this was in the context of a specific use case).

A./

Received on Sun Feb 21 2016 - 12:33:25 CST

This archive was generated by hypermail 2.2.0 : Sun Feb 21 2016 - 12:33:25 CST