> Date: Thu, 19 Feb 2015 13:08:57 -0800
> From: Markus Scherer <markus.icu_at_gmail.com>
> Cc: Philippe Verdy <verdy_p_at_wanadoo.fr>, Julian Bradfield <jcb+unicode_at_inf.ed.ac.uk>,
> Unicode Mailing List <unicode_at_unicode.org>
>
> Sorry, I disagree. First, collation data is overkill for search,
> since the order information is not required, so the weights are simply
> wasting storage. Second, people do want to find, e.g., "²" when they
> search for "2" etc.
>
> Depends on what you do.
The context is text search, where the user enters the search string
and specifies the strength of the required matches, and the editor
then searches a (potentially very large) buffer of text.
> "the weights are simply wasting storage" is not really
> true, you do have to encode something for which characters are same or
> different, and it turns out that that comes close to defining a sort order.
> Some people also want to ignore accents, others don't.
I think decomposition to NFKD solves these issues, doesn't it?
> As to your original question, Unicode collation would give you primary-equal
> "mem" and "sigma" characters.
> 05DE; [63 1E, 05, 05] # Hebr Lo [1F81.0020.0002] * HEBREW LETTER MEM
> FB26; [63 1E, 05, 20] # Hebr Lo [1F81.0020.0005] * HEBREW LETTER WIDE FINAL MEM
> 05DD; [63 1E, 05, 2E] # Hebr Lo [1F81.0020.0019] * HEBREW LETTER FINAL MEM
> FB3E; [63 1E, 05, 05][, E5 B1, 05] # Hebr Lo [1F81.0020.0002][0000.005F.0002] *
> HEBREW LETTER MEM WITH DAGESH
>
> 03C3; [5F 42, 05, 05] # Grek Ll [1C95.0020.0002] * GREEK SMALL LETTER SIGMA
> 03F2; [5F 42, 05, 10] # Grek Ll [1C95.0020.0004] * GREEK LUNATE SIGMA SYMBOL
> 1D6D3; [5F 42, 05, 17] # Zyyy Ll [1C95.0020.0005] * MATHEMATICAL BOLD SMALL
> FINAL SIGMA
> ...
> 03C2; [5F 42, 05, 33] # Grek Ll [1C95.0020.0019] * GREEK SMALL LETTER FINAL
> SIGMA
>
> You can certainly simplify a few things when you don't care about the order,
> therefore CLDR defines "search" tailorings. Some popular browsers use
> collation-based search for ctrl-F in-page search, either with strength=primary
> (ignore accent/case/etc. variants), or with asymmetric search. ICU implements
> those algorithms and carries the CLDR tailorings.
>
> See http://www.unicode.org/reports/tr10/#Searching
Thanks. I've studied that already, and I do know that collation data
can be used for search. But it's still a lot of data that I'd like to
avoid loading, if possible.
_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Fri Feb 20 2015 - 01:53:19 CST
This archive was generated by hypermail 2.2.0 : Fri Feb 20 2015 - 01:53:20 CST