[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #6267(accepted data)

Opened 4 years ago

Last modified 18 months ago

Add more characters to Latin-Ascii transform

Reported by: mark Owned by: mark
Component: translit Data Locale:
Phase: rc Review:
Weeks: Data Xpath:
Xref:

ticket:5604

ticket:6593

ticket:6594

ticket:6595

Description

We've had a request for a transform that takes everything to A-Z.

To complete that I suggest the following.

Phase 1.

  1. Add mappings to Latin-Ascii for the latin characters in cldr exemplars. There are only a few:

12 [ǀ-ǃǝǯɔəɣʒʔꞌ]

  1. Add mappings to Any-Latin. This actually consists of adding transliterators (or to transliterators) for the characters that currently don't map to Latin. Note: we should look at the code, because the Any-Latin transliterator might not be trying the BGN variants when it should. I think many of these should be covered if we do that.

2342 [ѣѫҗҝңүұҳҹһӊөٮٯٲٹ-ٽځڅڈډڑړږڜڢڥڧڨګںڼھہ-ۄۇۉۍېےൺ-ൿඅ-ඖක-නඳ-රලව-ෆກຂຄງຈຊຍດ-ທນ-ຟມ-ຣລວສຫອ-ະາຳຽເ-ໄໆ໐-໙ໜໝༀཀ-གང-ཇཉ-ཌཎ-དན-བམ-ཛཝ-ཨཪက-ဪဿၐ-ၕჱჲჵ-ჺሀ-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ-ኍነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕጘ-ፚᎀ-ᎏᎠ-Ᏼក-អឥ-ឧឩ-ឳⴀ-ⴥⴰⴱⴳⴷⴹⴻ-ⴽⵀⵃ-ⵅⵇⵉⵊⵍ-ⵏⵓ-ⵖⵙ-ⵜⵟⵡ-ⵣⵥⶀ-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞ々ꀀ-ꒌꔀ-ꘌꘐ-ꘫ]

  1. Add a test to verify that there are no left-over characters.

Phase 2.

Do the above, but over all Unicode characters, restricted to RECOMMENDED and ASPIRATIONAL scripts (maybe also excluding xmod historic), but excluding the cjk ideographs, Yi.

  1. Adding the following to Latin-ASCII

318 [Ƅ-Ɔƍ-ƏƔƛƜƟƦ-ƪƱƷ-ƿǮǶǷȜȝȠȢȣɁɂɅɊɋɐ-ɒɘɚɜ-ɞɤɥɩɮ-ɰɵɷ-ɻɿʁʃ-ʇʊʌ-ʎʓʕ-ʘʚʞʡʢʤʧ-ʩʬ-ʯʴ-ʶˠˤᴂᴈᴉᴎᴐ-ᴗᴙᴚᴝ-ᴟᴣ-ᴥᴯᴲᴻᴽᵄ-ᵆᵊᵌᵎᵓ-ᵕᵙᵚᵜᵷᵹᵼᵿᶋᶐᶔᶕᶗᶘᶚᶛᶟᶣᶥᶭᶱᶲᶴᶷᶺᶾẟₔℲⅎↀ-ↈⱠ-ⱻⱾⱿꜢ-ꞇꞋꞍꞎꞐ-ꞓꞠ-Ɦꟺ-ꟿ]

  1. Ensuring the following are added to Any-Latin (actually, many are archaic, and can be filtered out. Just haven't done that yet.)

2198 [Ͱ-ͳͶͷͺ-ͽϏϗ-ϡϼ-ϿѠ-ѢѤ-ѪѬ-ҁҊ-ҏҖҜҞ-ҢҤ-ҮҰҲҴ-ҸҺҼ-ӀӃ-ӉӋ-ӏӠӡӨӪӫӶӷӺ-ԧՙؠػ-ؿٱٳ-ٸٿڀڂ-ڄڇڊ-ڐڒڔڕڗڙڛڝ-ڡڣڦڪڬڮڰ-ڹڻڽڿۀۅۆۈۊێۏۑۓەۥۦۮۯۺ-ۼۿݐ-ݿޜޡޥޱࢠࢢ-ࢬॱ-ॷॹ-ॼॾॿ୲-୷ௐఽౘౙ౸-౾ೱೲഩഺഽൎ൰-൵ໞໟ༠-༳གྷཌྷདྷབྷཛྷཀྵཫཬྈ-ྌ၀-၉ၚ-ၝၡၥၦၮ-ၰၵ-ႁႎ႐-႙Ⴀ-ჅჇჍჽ-ჿᄓ-ᅠᅶ-ᆧᇃ-ᇿ፩-፼ᐁ-ᙬᙯ-ᙿឣឤឨៗៜ០-៩៰-៹᠐-᠙ᠠ-ᡷᢀ-ᢨᢪᢰ-ᣵᴦ-ᴫⴧⴭⴲⴴ-ⴶⴸⴺⴾⴿⵁⵂⵆⵈⵋⵌⵐ-ⵒⵗⵘⵝⵞⵠⵤⵦⵧⵯ〻ゕゖㄪ-ㄭㅀㅄㅤ-ㆎㆠ-ㆺㇰ-ㇿꙀ-ꙮꙿ-ꚗꣲ-ꣷꣻꥠ-ꥼꩠ-ꩶꩺꬁ-ꬆꬉ-ꬎꬑ-ꬖꬠ-ꬦꬨ-ꬮힰ-ퟆퟋ-ퟻﭐ-ﭕﭚ-ﭩﭮ-ﭹﭾ-ﮉﮌﮍﮖ-ﮱﯗ-ﯝﯠ-ﯧﯬﯭﯰ-ﯸﷰﷱﹳᅠᄚᄡ𐅀-𐅸𐆊𐹠-𐹾𖼀-𖽄𖽐𖾓-𖾟𛀀𛀁𞸜-𞸟𞹝𞹟𞹼𞹾]

  1. Tweak the test.

Phase 3.

(which we maybe never get to).

Do the remaining historic/limited use scripts/character (excluding CJK Ideographs, maybe Yi)

5535 [Ϣ-ϯܭ-ܯݍ-ݏ߀-ߪߴߵߺࠀ-ࠕࠚࠤࠨࡀ-ࡘ -ᚚᚠ-ᛪᛮ-ᛰᜀ-ᜌᜎ-ᜑᜠ-ᜱᝀ-ᝑᝠ-ᝬᝮ-ᝰᤀ-ᤜ᥆-ᥭᥰ-ᥴᦀ-ᦫᧁ-ᧇ᧐-᧚ᨀ-ᨖᨠ-ᩔ᪀-᪉᪐-᪙ᪧᬅ-ᬳᭅ-ᭋ᭐-᭙ᮃ-ᮠᮮ-ᯥᰀ-ᰣ᱀-᱉ᱍ-ᱽⰀ-Ⱞⰰ-ⱞⲀ-ⳤⳫ-ⳮⳲⳳ⳽ꓐ-ꓽꚠ-ꛯꠀꠁꠃ-ꠅꠇ-ꠊꠌ-ꠢꡀ-ꡳꢂ-ꢳ꣐-꣙꤀-ꤥꤰ-ꥆꦄ-ꦲ꧐-꧙ꨀ-ꨨꩀ-ꩂꩄ-ꩋ꩐-꩙ꪀ-ꪯꪱꪵꪶꪹ-ꪽꫀꫂꫛ-ꫝꫠ-ꫪꫲ-ꫴꯀ-ꯢ꯰-꯹𐀀-𐀋𐀍-𐀦𐀨-𐀺𐀼𐀽𐀿-𐁍𐁐-𐁝𐂀-𐃺𐊀-𐊜𐊠-𐋐𐌀-𐌞𐌠-𐌣𐌰-𐍊𐎀-𐎝𐎠-𐏃𐏈-𐏏𐏑-𐏕𐐀-𐒝𐒠-𐒩𐠀-𐠅𐠈𐠊-𐠵𐠷𐠸𐠼𐠿-𐡕𐡘-𐡟𐤀-𐤛𐤠-𐤹𐦀-𐦷𐦾𐦿𐨀𐨐-𐨓𐨕-𐨗𐨙-𐨳𐩀-𐩇𐩠-𐩾𐬀-𐬵𐭀-𐭕𐭘-𐭲𐭸-𐭿𐰀-𐱈𑀃-𑀷𑁒-𑁯𑂃-𑂯𑃐-𑃨𑃰-𑃹𑄃-𑄦𑄶-𑄿𑆃-𑆲𑇁-𑇄𑇐-𑇙𑚀-𑚪𑛀-𑛉𒀀-𒍮𒐀-𒑢𓀀-𓐮𖠀-𖨸]

Attachments

Change History

comment:1 Changed 4 years ago by emmons

  • Owner changed from anybody to mark
  • Priority changed from assess to medium
  • Type changed from unknown to enhancement
  • Status changed from new to assigned
  • Milestone changed from UNSCH to 24rc

comment:2 Changed 4 years ago by mark

  • Component changed from unknown to data-translit

comment:3 Changed 4 years ago by pedberg

  • Cc pedberg added
  • Xref set to 5604, 6593, 6594, 6595

Notes:

  • Part 1B of this is also covered by a separate bug, cldrbug 5604: . And that in turn depends on adding transforms for at least three additional scripts - Khmer, Lao, Sinhala - so that is not going to get done for 24rc (thus this is not likely too either, at least in its full glory).
  • Also note that there are some interesting issues that come up when converting from Any to ASCII via Latin (Any-Latin;Latin-ASCII). For example: A common ASCII representation (e.g. in chats) for Arabic letters hamza and ain are the ASCII digits 2 and 3 respectively (some graphic similarity). Now, Arabic-Latin (and thus Any-Latin) converts hamza to ʾ \u02BE (right half ring) and ain to ʿ \u02BF (left half ring), which is correct. Latin-ASCII does not currently convert \u02BE and \u02BF. But if we add the conversions, should we (a) convert them to something based only on the codes \u02BE and \u02BF (in which case we might convert them to something like left and right parens), or (b) convert them assuming that the likely way they got into Latin is by conversion from Arabic, in which case we might convert them to 2 and 3? I would vote for the latter. This will make Any-Latin;Latin -ASCII much more useful.

comment:4 Changed 4 years ago by emmons

  • Milestone changed from 24rc to 25dsub

comment:5 Changed 3 years ago by emmons

  • Milestone changed from 25dsub to 25rc

Moving all 25dsub and 25design tickets to 25rc. If you plan to complete items in the 25M1 time frame, please move those tickets to 25M1.

comment:6 Changed 3 years ago by mark

  • Milestone changed from 25rc to 26rc

comment:7 Changed 3 years ago by mark

  • Milestone changed from 26rc to 27dsub

comment:8 Changed 3 years ago by markus

  • Phase set to dsub
  • Milestone changed from 27dsub to 27

comment:9 Changed 2 years ago by mark

  • Milestone changed from 27 to 28

comment:10 Changed 2 years ago by mark

  • Phase changed from dsub to rc

comment:11 Changed 2 years ago by markus

  • Type changed from enhancement to data

comment:12 Changed 2 years ago by srl

  • Status changed from assigned to accepted

comment:13 Changed 19 months ago by mark

  • Milestone changed from 28 to 29

comment:14 Changed 18 months ago by emmons

  • Milestone changed from 29 to upcoming

Auto move of all 29 -> upcoming

View

Add a comment

Modify Ticket

Action
as accepted
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.