[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #3560(closed enhancement: fixed)

Opened 5 years ago

Last modified 4 years ago

Korean search collator variant for "initial consonant" search

Reported by: pedberg Owned by: pedberg
Component: main Data Locale:
Phase: Review: jungshik
Weeks: Data Xpath:
Xref:

Description

A popular way of searching for contacts in Korean (e.g. on some phones) is to use the initial consonant of each Hangul - that is, a search pattern of "ㅂㅁㅇ" should find "박무이".

We could perhaps support this by adding another Korean search collator variant that tailors the V and T jamos to have primary weight 0 (like combining marks).

Attachments

Change History

comment:1 Changed 5 years ago by pedberg

  • Cc jungshik added

comment:2 Changed 5 years ago by kent.karlsson14@…

Well, "tailors the V and T jamos to have primary weight 0" does not strictly give *the* first consonant, just the lead group of consonants. To get the very first consonant in a syllable, in general, one would also need to give primary weight 0 to lead but non-first consonants, as well as weight first-in-syllable lead consonant cluster characters at level 1 as if they consisted only of the first consonant in the consonant cluster.

But maybe this suggestion is not aimed to be so general.

comment:3 Changed 5 years ago by mark

  • Owner changed from somebody to pedberg
  • Priority changed from assess to medium
  • Status changed from new to assigned
  • Milestone changed from UNSCH to 2.0

new variant: abbreviation-search
update LDML, bcp47.
trailing items will have accent weight.

comment:4 Changed 5 years ago by pedberg

Kent,
Yes, good point, I guess we would need to give all L+ consonant characters with the same initial jamo the same primary weight. We can work out those details.

comment:5 Changed 4 years ago by pedberg

  • Milestone changed from 2.0 to 2.0.1

comment:6 Changed 4 years ago by pedberg

  • Milestone changed from 2.0.1 to 21m1

I have a version of this working (just for modern Korean). In common/collation/ko.xml I add a new collation "search-lead-cons" which is like the "search" collator in Korean, except that:

  1. At the end of the current list of secondary differences after <last_primary_ignorable/>, we add all of the (modern) conjoining vowels and trailing consonants (i.e. 1161-1175 and 11A8-11C2) as additional secondary differences, and
  2. We replace all of the Korean rules from the search collator with the following rules for the lead consonants:
    <reset>ᄀ</reset><s>ᄁ</s><i>ᄀᄀ</i>
    <reset>ᄃ</reset><s>ᄄ</s><i>ᄃᄃ</i>
    <reset>ᄇ</reset><s>ᄈ</s><i>ᄇᄇ</i>
    <reset>ᄉ</reset><s>ᄊ</s><i>ᄉᄉ</i>
    <reset>ᄌ</reset><s>ᄍ</s><i>ᄌᄌ</i>
    

But there are some issues / questions:

  1. Is the name "search-lead-cons" OK?
  2. This enables use of a sequence of conjoining lead consonants to find a sequence of Hangul (composed or decomposed) with the same initial consonants. It does not support use of compatibility jamo for this.
  3. It is only really useful with ICU's asymmetric search is used; if the search pattern is a complete Hangul, we only want to match that Hangul in the text (not other Hangul that begin with the same consonant), but if the pattern is a sequence of initial consonants, then we want to match corresponding Hangul sequences in the text.
  4. Currently this collator tickles a crashing bug in the ICU search code. Working on debugging that.

Given this, I think we need to wait before rolling this in. Deferring to CLDR 21.

comment:7 Changed 4 years ago by pedberg

Actually, we can address #2 and support a pattern of compatibility jamos by slightly tailoring the initial consonant rules as follows:

<reset>ᄀ</reset><s>ᄁ</s><i>ᄀᄀ</i><t>ㄲ</t>
<reset>ᄃ</reset><s>ᄄ</s><i>ᄃᄃ</i><t>ㄸ</t>
<reset>ᄇ</reset><s>ᄈ</s><i>ᄇᄇ</i><t>ㅃ</t>
<reset>ᄉ</reset><s>ᄊ</s><i>ᄉᄉ</i><t>ㅆ</t>
<reset>ᄌ</reset><s>ᄍ</s><i>ᄌᄌ</i><t>ㅉ</t>

And I have a fix for #4 the ICU bug, see http://bugs.icu-project.org/trac/ticket/8681

So we just need to agree on a name for this variant search collator.

comment:8 Changed 4 years ago by pedberg

  • Milestone changed from 21m1 to 2.0.1

Agreed to move back into 2.0.1, and make the name "searchjl". Needs to be added in BCP47 and spec

comment:9 Changed 4 years ago by pedberg

  • Status changed from assigned to accepted
  • Review set to jungshik

OK, in addition to the commits shown above, I also have an update for the Key/Type Definitions table in TR35 to add the "searchjl" type. My commit of this is currently failing due to a permissions error, but I will note when I have finally managed to get it committed.

comment:10 Changed 4 years ago by pedberg

OK, after an svn database permissions fix I was able to commit the TR35 change. You can see it as the yellow-highlighted section in http://www.unicode.org/draft/reports/tr35/tr35.html#Key_Type_Definitions

comment:11 Changed 4 years ago by jungshik

  • Status changed from accepted to closed
  • Resolution set to fixed

Sorry for the late reply. I prefer to have this in the root, but it's ok for now. I'll file a new bug for considering that.

View

Add a comment

Modify Ticket

Action
as closed
The ticket will be disowned. The resolution will be deleted. Next status will be 'new'
Next status will be 'closed'
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.