On Thu, Feb 19, 2015 at 12:17 PM, Eli Zaretskii <eliz_at_gnu.org> wrote:
> Sorry, I disagree. First, collation data is overkill for search,
> since the order information is not required, so the weights are simply
> wasting storage. Second, people do want to find, e.g., "²" when they
> search for "2" etc.
>
Depends on what you do. "the weights are simply wasting storage" is not
really true, you do have to encode something for which characters are same
or different, and it turns out that that comes close to defining a sort
order. Some people also want to ignore accents, others don't.
As to your original question, Unicode collation would give you
primary-equal "mem" and "sigma" characters.
05DE; [63 1E, 05, 05] # Hebr Lo [1F81.0020.0002] * HEBREW LETTER MEM
FB26; [63 1E, 05, 20] # Hebr Lo [1F81.0020.0005] * HEBREW LETTER WIDE FINAL
MEM
05DD; [63 1E, 05, 2E] # Hebr Lo [1F81.0020.0019] * HEBREW LETTER FINAL MEM
FB3E; [63 1E, 05, 05][, E5 B1, 05] # Hebr Lo
[1F81.0020.0002][0000.005F.0002] * HEBREW LETTER MEM WITH DAGESH
03C3; [5F 42, 05, 05] # Grek Ll [1C95.0020.0002] * GREEK SMALL LETTER SIGMA
03F2; [5F 42, 05, 10] # Grek Ll [1C95.0020.0004] * GREEK LUNATE SIGMA SYMBOL
1D6D3; [5F 42, 05, 17] # Zyyy Ll [1C95.0020.0005] * MATHEMATICAL BOLD SMALL
FINAL SIGMA
...
03C2; [5F 42, 05, 33] # Grek Ll [1C95.0020.0019] * GREEK SMALL LETTER FINAL
SIGMA
You can certainly simplify a few things when you don't care about the
order, therefore CLDR defines "search" tailorings. Some popular browsers
use collation-based search for ctrl-F in-page search, either with
strength=primary (ignore accent/case/etc. variants), or with asymmetric
search. ICU implements those algorithms and carries the CLDR tailorings.
See http://www.unicode.org/reports/tr10/#Searching
Best regards,
markus
_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Thu Feb 19 2015 - 15:10:01 CST
This archive was generated by hypermail 2.2.0 : Thu Feb 19 2015 - 15:10:02 CST