CLDR Ticket #6267(accepted data)
Add more characters to Latin-Ascii transform
Reported by: | mark | Owned by: | mark |
---|---|---|---|
Component: | translit | Data Locale: | |
Phase: | rc | Review: | |
Weeks: | Data Xpath: | ||
Xref: |
Description
We've had a request for a transform that takes everything to A-Z.
To complete that I suggest the following.
Phase 1.
- Add mappings to Latin-Ascii for the latin characters in cldr exemplars. There are only a few:
12 [ǀ-ǃǝǯɔəɣʒʔꞌ]
- Add mappings to Any-Latin. This actually consists of adding transliterators (or to transliterators) for the characters that currently don't map to Latin. Note: we should look at the code, because the Any-Latin transliterator might not be trying the BGN variants when it should. I think many of these should be covered if we do that.
2342 [ѣѫҗҝңүұҳҹһӊөٮٯٲٹ-ٽځڅڈډڑړږڜڢڥڧڨګںڼھہ-ۄۇۉۍېےൺ-ൿඅ-ඖක-නඳ-රලව-ෆກຂຄງຈຊຍດ-ທນ-ຟມ-ຣລວສຫອ-ະາຳຽເ-ໄໆ໐-໙ໜໝༀཀ-གང-ཇཉ-ཌཎ-དན-བམ-ཛཝ-ཨཪက-ဪဿၐ-ၕჱჲჵ-ჺሀ-ቈቊ-ቍቐ-ቖቘቚ-ቝበ-ኈኊ-ኍነ-ኰኲ-ኵኸ-ኾዀዂ-ዅወ-ዖዘ-ጐጒ-ጕጘ-ፚᎀ-ᎏᎠ-Ᏼក-អឥ-ឧឩ-ឳⴀ-ⴥⴰⴱⴳⴷⴹⴻ-ⴽⵀⵃ-ⵅⵇⵉⵊⵍ-ⵏⵓ-ⵖⵙ-ⵜⵟⵡ-ⵣⵥⶀ-ⶖⶠ-ⶦⶨ-ⶮⶰ-ⶶⶸ-ⶾⷀ-ⷆⷈ-ⷎⷐ-ⷖⷘ-ⷞ々ꀀ-ꒌꔀ-ꘌꘐ-ꘫ]
- Add a test to verify that there are no left-over characters.
Phase 2.
Do the above, but over all Unicode characters, restricted to RECOMMENDED and ASPIRATIONAL scripts (maybe also excluding xmod historic), but excluding the cjk ideographs, Yi.
- Adding the following to Latin-ASCII
318 [Ƅ-Ɔƍ-ƏƔƛƜƟƦ-ƪƱƷ-ƿǮǶǷȜȝȠȢȣɁɂɅɊɋɐ-ɒɘɚɜ-ɞɤɥɩɮ-ɰɵɷ-ɻɿʁʃ-ʇʊʌ-ʎʓʕ-ʘʚʞʡʢʤʧ-ʩʬ-ʯʴ-ʶˠˤᴂᴈᴉᴎᴐ-ᴗᴙᴚᴝ-ᴟᴣ-ᴥᴯᴲᴻᴽᵄ-ᵆᵊᵌᵎᵓ-ᵕᵙᵚᵜᵷᵹᵼᵿᶋᶐᶔᶕᶗᶘᶚᶛᶟᶣᶥᶭᶱᶲᶴᶷᶺᶾẟₔℲⅎↀ-ↈⱠ-ⱻⱾⱿꜢ-ꞇꞋꞍꞎꞐ-ꞓꞠ-Ɦꟺ-ꟿ]
- Ensuring the following are added to Any-Latin (actually, many are archaic, and can be filtered out. Just haven't done that yet.)
2198 [Ͱ-ͳͶͷͺ-ͽϏϗ-ϡϼ-ϿѠ-ѢѤ-ѪѬ-ҁҊ-ҏҖҜҞ-ҢҤ-ҮҰҲҴ-ҸҺҼ-ӀӃ-ӉӋ-ӏӠӡӨӪӫӶӷӺ-ԧՙؠػ-ؿٱٳ-ٸٿڀڂ-ڄڇڊ-ڐڒڔڕڗڙڛڝ-ڡڣڦڪڬڮڰ-ڹڻڽڿۀۅۆۈۊێۏۑۓەۥۦۮۯۺ-ۼۿݐ-ݿޜޡޥޱࢠࢢ-ࢬॱ-ॷॹ-ॼॾॿ୲-୷ௐఽౘౙ౸-౾ೱೲഩഺഽൎ൰-൵ໞໟ༠-༳གྷཌྷདྷབྷཛྷཀྵཫཬྈ-ྌ၀-၉ၚ-ၝၡၥၦၮ-ၰၵ-ႁႎ႐-႙Ⴀ-ჅჇჍჽ-ჿᄓ-ᅠᅶ-ᆧᇃ-ᇿ፩-፼ᐁ-ᙬᙯ-ᙿឣឤឨៗៜ០-៩៰-៹᠐-᠙ᠠ-ᡷᢀ-ᢨᢪᢰ-ᣵᴦ-ᴫⴧⴭⴲⴴ-ⴶⴸⴺⴾⴿⵁⵂⵆⵈⵋⵌⵐ-ⵒⵗⵘⵝⵞⵠⵤⵦⵧⵯ〻ゕゖㄪ-ㄭㅀㅄㅤ-ㆎㆠ-ㆺㇰ-ㇿꙀ-ꙮꙿ-ꚗꣲ-ꣷꣻꥠ-ꥼꩠ-ꩶꩺꬁ-ꬆꬉ-ꬎꬑ-ꬖꬠ-ꬦꬨ-ꬮힰ-ퟆퟋ-ퟻﭐ-ﭕﭚ-ﭩﭮ-ﭹﭾ-ﮉﮌﮍﮖ-ﮱﯗ-ﯝﯠ-ﯧﯬﯭﯰ-ﯸﷰﷱﹳᅠᄚᄡ𐅀-𐅸𐆊𐹠-𐹾𖼀-𖽄𖽐𖾓-𖾟𛀀𛀁𞸜-𞸟𞹝𞹟𞹼𞹾]
- Tweak the test.
Phase 3.
(which we maybe never get to).
Do the remaining historic/limited use scripts/character (excluding CJK Ideographs, maybe Yi)
5535 [Ϣ-ϯܭ-ܯݍ-ݏ߀-ߪߴߵߺࠀ-ࠕࠚࠤࠨࡀ-ࡘ -ᚚᚠ-ᛪᛮ-ᛰᜀ-ᜌᜎ-ᜑᜠ-ᜱᝀ-ᝑᝠ-ᝬᝮ-ᝰᤀ-ᤜ᥆-ᥭᥰ-ᥴᦀ-ᦫᧁ-ᧇ᧐-᧚ᨀ-ᨖᨠ-ᩔ᪀-᪉᪐-᪙ᪧᬅ-ᬳᭅ-ᭋ᭐-᭙ᮃ-ᮠᮮ-ᯥᰀ-ᰣ᱀-᱉ᱍ-ᱽⰀ-Ⱞⰰ-ⱞⲀ-ⳤⳫ-ⳮⳲⳳ⳽ꓐ-ꓽꚠ-ꛯꠀꠁꠃ-ꠅꠇ-ꠊꠌ-ꠢꡀ-ꡳꢂ-ꢳ꣐-꣙꤀-ꤥꤰ-ꥆꦄ-ꦲ꧐-꧙ꨀ-ꨨꩀ-ꩂꩄ-ꩋ꩐-꩙ꪀ-ꪯꪱꪵꪶꪹ-ꪽꫀꫂꫛ-ꫝꫠ-ꫪꫲ-ꫴꯀ-ꯢ꯰-꯹𐀀-𐀋𐀍-𐀦𐀨-𐀺𐀼𐀽𐀿-𐁍𐁐-𐁝𐂀-𐃺𐊀-𐊜𐊠-𐋐𐌀-𐌞𐌠-𐌣𐌰-𐍊𐎀-𐎝𐎠-𐏃𐏈-𐏏𐏑-𐏕𐐀-𐒝𐒠-𐒩𐠀-𐠅𐠈𐠊-𐠵𐠷𐠸𐠼𐠿-𐡕𐡘-𐡟𐤀-𐤛𐤠-𐤹𐦀-𐦷𐦾𐦿𐨀𐨐-𐨓𐨕-𐨗𐨙-𐨳𐩀-𐩇𐩠-𐩾𐬀-𐬵𐭀-𐭕𐭘-𐭲𐭸-𐭿𐰀-𐱈𑀃-𑀷𑁒-𑁯𑂃-𑂯𑃐-𑃨𑃰-𑃹𑄃-𑄦𑄶-𑄿𑆃-𑆲𑇁-𑇄𑇐-𑇙𑚀-𑚪𑛀-𑛉𒀀-𒍮𒐀-𒑢𓀀-𓐮𖠀-𖨸]
Attachments
Change History
comment:1 Changed 5 years ago by emmons
- Owner changed from anybody to mark
- Priority changed from assess to medium
- Type changed from unknown to enhancement
- Status changed from new to assigned
- Milestone changed from UNSCH to 24rc
comment:3 Changed 5 years ago by pedberg
- Cc pedberg added
- Xref set to 5604, 6593, 6594, 6595
Notes:
- Part 1B of this is also covered by a separate bug, cldrbug 5604: . And that in turn depends on adding transforms for at least three additional scripts - Khmer, Lao, Sinhala - so that is not going to get done for 24rc (thus this is not likely too either, at least in its full glory).
- Also note that there are some interesting issues that come up when converting from Any to ASCII via Latin (Any-Latin;Latin-ASCII). For example: A common ASCII representation (e.g. in chats) for Arabic letters hamza and ain are the ASCII digits 2 and 3 respectively (some graphic similarity). Now, Arabic-Latin (and thus Any-Latin) converts hamza to ʾ \u02BE (right half ring) and ain to ʿ \u02BF (left half ring), which is correct. Latin-ASCII does not currently convert \u02BE and \u02BF. But if we add the conversions, should we (a) convert them to something based only on the codes \u02BE and \u02BF (in which case we might convert them to something like left and right parens), or (b) convert them assuming that the likely way they got into Latin is by conversion from Arabic, in which case we might convert them to 2 and 3? I would vote for the latter. This will make Any-Latin;Latin -ASCII much more useful.
comment:5 Changed 4 years ago by emmons
- Milestone changed from 25dsub to 25rc
Moving all 25dsub and 25design tickets to 25rc. If you plan to complete items in the 25M1 time frame, please move those tickets to 25M1.
comment:14 Changed 3 years ago by emmons
- Milestone changed from 29 to upcoming
Auto move of all 29 -> upcoming