The sections below contain links to permanent feedback documents for the open Public Review Issues as well as other public feedback as of January 18, 2022, since the previous cumulative document was issued prior to UTC #169 (October 2021).
The links below go directly to open PRIs and to feedback documents for them, as of January 18, 2022.
Issue Name Feedback Link 441 Proposed Update UAX #29, Unicode Text Segmentation (feedback) 440 Proposed Update UTS #10, Unicode Collation Algorithm (feedback) 439 Proposed Update UAX #50, Unicode Vertical Text Layout (feedback) No feedback at this time 438 Proposed Update UAX #44, Unicode Character Database (feedback) 437 Proposed Update UAX #38, Unicode Han Database (Unihan) (feedback) No feedback at this time 436 UTS #37 Unicode Ideographic Variation Database (feedback) No feedback at this time 435 Unicode Emoji 15.0 Provisional Candidates (feedback) 434 CLDR Person Name Formatting (feedback) 427 Proposed Update UTS #18, Unicode Regular Expressions (feedback)
The links below go to locations in this document for feedback.
Feedback routed to Unihan ad hoc for evaluation
Feedback routed to Script ad hoc for evaluation
Feedback routed to Properties & Algorithms ad hoc for evaluation
Feedback routed to Emoji SC for evaluation
Feedback routed to Editorial Committee for evaluation
Other Reports
Date/Time: Fri Oct 8 20:34:45 CDT 2021
Name: Eduardo Marín Silva
Report Type: Other Document Submission
Opt Subject: Recomended addition of UAX #38
I would like to apologize if in the past I wasn't as helpful to the Unihan group as I should have. This is a more succinct recommendation to the Unihan group, to add the number of entries of each field in the description boxes of the document in question: https://unicode.org/reports/tr38/. Since Dr. Lunde already compiled the most the number of entries in this document: https://docs.google.com/spreadsheets/d/1_ad7Z9qqMONlK5SUfNjaSXSNIaG-0POQKTlUqCvxdhI/edit#gid=559817095 it would be trivial to add the most up to date counts in new versions of UAX #38. This would be convenient for users, that might not want to look at different documents for that info. This info can give a sense to users of how large each field is and comparing the counts in future versions can also reflect the growth of the database.
Date/Time: Sat Nov 20 04:41:17 CST 2021
Name: Eiso Chan
Report Type: Error Report
Opt Subject: Unihan Database
The kMandarin value for U+266E8 𦛨 is lao. Maybe we need to modify it to láo based on the corresponding Traditional variant U+6725 朥. This character is very common in Teochow-Swatow Min-dialects for the local food 𦛨饼, but it’s a pity that it has not been included in TGH.
Date/Time: Sat Nov 27 22:35:35 CST 2021
Name: Jerry Rossignuolo
Report Type: Error Report
Opt Subject: 19227-n5100r-10646-6th-ed-cd3-chart.pdf
Hello, I see you have the radical "⺜" interpreted as the radical "sun". ⺜ is used as variant of 日 (sun) in the 新华字典部首 (XinHuaZiDian BuShou). Yet not all variants of a BuShou per GF 0011-2009 are of the same radical. Most of the material I am finding lists the radical ⺜ is a variant of 冃 with the meaning of cap or hat. I believe this dates back to the Shuowen Jiezi (说文解字) dictionary. http://www.shuowen.org/?bushou=%E5%86%83 I also believe ⺜ as a radical is named 冒字头 which further has me thinking this radical has the meaning of cap or hat. Yet, I am not sure. Is it possible if you could clarify this for me? I ran across this while working on a Chinese language learning tool and need to correctly identify the meaning of ⺜. Thanks, Jerry Rossignuolo
Date/Time: Sat Dec 18 19:28:56 CST 2021
Name: Richard Hsieh
Report Type: Error Report
Opt Subject: CJK chart 4E30 mixed up
4E30 丰 having HB1-A4A5 and T1-4464 that are the Traditional Chinese character. The rest of the characters are the Simplified Chinese c haracters. They are two different characters and cannot be mixed up. Could not come up with the Traditional character for the name of a person and other things because of softwares that carried both at the same time could not tell apart but to placed the Simplified Chinese in placed of the Traditional Chinese character.
Date/Time: Tue Jan 4 07:17:52 CST 2022
Name: Andrew Christopher West
Report Type: Other Document Submission
Opt Subject: CJK Ext. H U+31682 (UK-10989)
L2-21/053 "Additional repertoire for a future version of Unicode (post Unicode 14.0)" lists the proposed code chart for CJK Unified Ideographs Extension H. The character at U+31682 (UK-10989) is indexed as Radical 40 plus one residual stroke, and placed first under radical 40 (宀). This is clearly wrong because: 1) the character does not include radical 40; and 2) the total stroke count is 9. I suggest changing back to the original proposed index of radical 25 plus 7 residual strokes, and reordering after U+31455.
Date/Time: Thu Jan 6 20:28:54 CST 2022
Name: Ken Lunde
Report Type: Error Report
Opt Subject: Unihan Database
The kTotalStrokes property value of U+2AB8F 𪮏 (⿰手思) should be changed from 12 to 13, because its indexing radical is composed of four strokes, not three.
Date/Time: Thu Jan 6 07:53:34 CST 2022
Name: Ivan Panchenko
Report Type: Error Report
Opt Subject: Unihan_Readings.txt
I want to report some mistakes in the Unihan Database definitions (kDefinition): “from from” (instead of “from”), “disturbe” (instead of “disturb”), “pon your mind” (instead of “on your mind”), “thron” (instead of “thorn”), “flaten”(instead of “flatten”), “name name” (instead of “name”), “chrysanthemun” (instead of “chrysanthemum”), “purpurca” (instead of “purpurea”), “force fo arms” (instead of “force of arms”), “phtholein” (instead of “phthalein”), “the the” (instead of “the”), “ber eaten” (instead of “be eaten”), “bured” (instead of “buried”), “askewd” (instead of “askew”), “foll of cloth” (instead of “roll of cloth”), “smilling” (instead of “smiling”), “without friends or relativ” (instead of “without friends or relatives”), “longtum” (instead of “longum”), “themedia forskali” (instead of “Themeda forskalii”), “circium” (instead of “Cirsium”), “bracenia” (instead of “Brasenia”), “artemesia” (instead of “artemisia”), “corp of a bird” (instead of “crop of a bird”), “stellariana” (instead of “stelleriana”), “eumenes polifomis” (apparently should be “Eumenes pomiformis”, but is this definition actually correct??), “loquatious” (instead of “loquacious”), “interprete” (instead of “interpret”), “liesure” (instead of “leisure”), “mischevious” (instead of “mischievous”), “fy” (instead of “fry”), “incorruptable” (instead of “incorruptible”), “repse” (instead of “repose”). The following words should be capitalized: “sanskrit”, “buddhist”, “daoist”, “pekinese”, “persian”. Scientific names are capitalized as well, so this should be corrected: “malus”, “canis”, “ursus”, “rubia”, “plantago”, “piper”, “caryopteris”, “hydropyrum”, “artemisia stelleriana”, “gracilaria”, “vitis”, “valeriana”, “pteris”, “ligusticum”, “allium”, “cyperus”, “lophanthus”, “arca”, “libellulidae”, “vipera”, “brachyura”, “acrida”, “cosmopsaltria”, “parasilurus”, “spheroides”, “coryphaena”, “pagrosomus”, “treron”, “grus”. This is questionable, but I am unsure what this is supposed to mean: “leucacene”, “suffle”.
Date/Time: Mon Jan 10 18:45:29 CST 2022
Name: Jaemin Chung
Report Type: Error Report
Opt Subject: Defect report on USourceData.txt
In USourceData.txt, some IDSes have semicolons in them. This is bad because a semicolon is already used as a delimiter. Here are the IDSes with semicolons: UTC-03134;B;U+28559;162.9;;⿺辶⿹&P7-03;刀;UTCDoc L2/17-204;;13 12;3 UTC-03143;WS-2017;;145.12;;⿳⿱&H5-01;冖石衣;UTCDoc L2/17-204;;18;1 UTC-03156;WS-2017;;94.8;;⿱&H8-01;犬;UTCDoc L2/17-204;;12;1
Date/Time: Wed Sep 29 14:19:34 CDT 2021
Name: Eduardo Marín Silva
Report Type: Other Document Submission
Opt Subject: On the apparent arbitrary exclusion of the play button as a disctinct character on the face of duplicates
The relevant symbols discussed are old (since at least the period where cassette players where popular); they were later adopted in so many contexts, that they could be said to be universal representations of their respective functions. Naturally, since they were (and still are) so important, most of them were assigned a Unicode codepoint on the "Miscellaneous Technical" block, with some being apparently duplicated. Here I proceed to discuss those: ⏴⏵⏶⏷ (23F4-23F7) and 🞀🞁🞂🞃 (1F780-1F783): Both of them seem to serve the same purpose, with the second set having the term "ISOSCELES RIGHT TRIANGLE" being applied instead of simply "TRIANGLE" to distinguish them. Both sets are isosceles and have right angles so the differences in name are not helpful. In practice, it seems like the first set tends to have consistent advance width with padding at all sides, while the other set tends to have a tight advance width with respect to the glyph, which means that the up and down arrows end up slightly wider than the left and right ones. If this is the "true" difference between them, then the name chosen does not reflect that, and it is unclear why they couldn't be unified anyway. i.e. why was it important to have both sets? 23F9 ⏹ BLACK SQUARE FOR STOP, 25A0 ■ BLACK SQUARE, 25FC ◼ BLACK MEDIUM SQUARE, 2B1B ⬛ BLACK LARGE SQUARE, 2BC0 ⯀ BLACK SQUARE CENTRED and 1F532 🔲 BLACK SQUARE BUTTON: Out of all of them, the most generic is 25A0, perhaps it was disunified into 23F9 because on user interfaces it is important that all buttons have the same width, while 25A0 was free to lack padding at both sides. 2BC0, the "centered" one forms part of a set, where "centered" just means the figures have consistent padding at both sides. The last character is disunified on account of the different function in UI's where it has a dual and represent a selected or unseleceted button. Similarly, while disunifying on account of size makes sense, either 25FC or 2B1B could have been used for the "stop" function if only one of them was declared to be it. 23FA ⏺ BLACK CIRCLE FOR RECORD, 25CF ● BLACK CIRCLE, 26AB ⚫ MEDIUM BLACK CIRCLE and 1F534 🔴 LARGE RED CIRCLE: A similar situation to the "stop" symbol applies to the "record" one, with one caveat; the symbol is often shown with a red color. With this in mind, not only does it make sense to disunify it from 25CF, it also makes sense to disunify it from 26AB and 1F53A, on account of the stability of their colors. So there are no problematic disunifications here. Except maybe 25CF ● and 2022 • BULLET, but that is independent of the issue at hand. The only symbol to NOT be disunified was the "play" symbol, the closest matches being 25B6 ▶ BLACK RIGHT-POINTING TRIANGLE and 2BC8 ⯈ BLACK MEDIUM RIGHT-POINTING TRIANGLE CENTRED. It makes little sense to disunify the symbols already discussed, but not this one. Whatever rationale applied to the other characters, should also apply to this one I therefore highly recommend to encode a new symbol, The glyph would harmonize great with the other symbols, since it can have a smaller glyph and the padding necessary at the same time. Disunification also has the benefit of allowing fonts to depict the symbols inside an enclosure by default, since that is what users often expect. I suggest the name BLACK RIGHT-POINTING TRIANGLE FOR PLAY or BLACK RIGHT-POINTING EQUILATERAL TRIANGLE FOR PLAY. If a separate document needs to be written for it I would gladly do so.
Date/Time: Wed Oct 13 15:00:17 CDT 2021
Name: Eduardo Marín Silva
Report Type: Other Document Submission
Opt Subject: Suggestion on the encoding of Latin Theta
In the document L2/21-206, it is suggested to encode a Latin casing pair for Theta. The difference with the Greek pair Θθ (0398 and 03B8) is that the capital form, always has a horizontal stroke that touches both sides of the letter, while the Greek letter can have a shorter stroke with its own serifs. Another similar pair is the Latin Ɵɵ (019F and 0275), the difference with this pair, is that the lowercase is at x height, while the orthography requires a tall glyph like the proper Greek small theta. Encoding a new pair is problematic, since phonetic notations already use the Greek codepoint (03B8) for the same sound. I propose some possible solutions. 1. Use the Greek pair: In order to force the preferred glyph for Latin based orthographies, a SVS can be added, called "latin form" or "long stroke form". This would mean that the default glyph is still what the Greek users expect, and the small Theta remains untouched. 2. Use the Latin barred o pair: Similarly, in order to force the preferred glyph on the lowercase, a SVS can be added, called "theta form", "tall form" or "elongated form". This has the benefit of keeping the text completely Latin. Characters that are confusable with others, but only in certain contexts, is not new. (Deciding between 1 or 2 depends of what the users prefer in case the default glyph has to be displayed; either an uppercase with a shorter stroke or a shorter lowercase) 3. Just encode a small Latin Theta and make it an alternate lowercase to 019F: Such a solution is my least preferred one, since it has the same downsides as just encoding a new Latin pair. 4. Bite the bullet and encode the new pair: It wouldn't be the first time confusable characters are disunified due to problematic casing relations. All other letters in the document are acceptable, but I would rename the first pair as LATIN CAPITAL/SMALL LETTER REVERSED GLOTTAL STOP. They should be disunified on the same basis of the regular glottal stop pair.
Date/Time: Tue Nov 16 12:31:51 CST 2021
Name: Jack Varanelli
Report Type: Error Report
Opt Subject: Unicode request for legacy Malayalam
To whom it may concern: I am a student with no real position in Unicode. However, I noticed that the names of U+0272 and the proposed character for U+1DF27 in this document [https://www.unicode.org/L2/L2021/21156-legacy-malayalam.pdf] have the same name (LATIN SMALL LETTER N WITH LEFT HOOK). This has been added to the Unicode Pipeline, so I am left to assume its inclusion is planned. Knowing this may cause confusion, I'd advise a name change to the proposed character, if possible. Apologies if this was intentional and an oversight on my part. Sincerely, Jack Varanelli
Date/Time: Sat Sep 18 15:50:43 CDT 2021
Name: David Corbett
Report Type: Error Report
Opt Subject: Mistake about U+0953 and U+0954
Chapter 12 says “Because U+0953 and U+0954 are not intended to be used with the Devanagari script, they have no explicit property values for Indic_Positional_Category and Indic_Syllabic_Category”, but that is not true. They both still have the explicit Indic_Positional_Category value of Top.
Date/Time: Fri Oct 15 17:59:27 CDT 2021
Name: Yannick Duchêne
Report Type: Error Report
Opt Subject: UAX29
Referring to version 13, unless I’m wrong, the sample at line #1725 of WordBreakTest.txt, exposes a case of a grapheme being broken. The test sequence is: ALetter RI ZWJ RI RI ALetter As graphemes, I believe it is: (ALetter) (RI ZWJ) (RI RI) (ALetter) But the sample says, as words, it is: (ALetter) (RI ZWJ RI) (RI) (ALetter) The third grapheme, (RI RI), is broken in two parts, its first RI goes to one word and its second RI, to another word. The comment is correcte about the rule applied, so may be this is an unintended effect of the rules for word boundaries or for grapheme boundaries in UAX #29. It may be not intended, since §6 says “The other default boundary specifications never break within grapheme clusters”.
Date/Time: Mon Oct 25 15:20:07 CDT 2021
Name: Gary Wade
Report Type: Other Document Submission
Opt Subject:
Originally submitted against CLDR at https://unicode-org.atlassian.net/browse/CLDR-15118 If there is a better avenue, please provide a direct link as I saw no other appropriate place to do so. Persian digits (U+06F0-U+06F9) are not considered Arabic Numbers in UnicodeData.txt Values for U+06F0 to U+06F9 are considered to be European Numbers rather than Arabic Numbers, and so based on a bidi property lookup, these are not considered to be "RTL-weak" for lack of a better phrase like values U+0660 to U+0669, and so some algorithms will always consider them to be "LTR-weak". 06F0;EXTENDED ARABIC-INDIC DIGIT ZERO;Nd;0;EN;;0;0;0;N;EASTERN ARABIC-INDIC DIGIT ZERO;;;; 06F1;EXTENDED ARABIC-INDIC DIGIT ONE;Nd;0;EN;;1;1;1;N;EASTERN ARABIC-INDIC DIGIT ONE;;;; 06F2;EXTENDED ARABIC-INDIC DIGIT TWO;Nd;0;EN;;2;2;2;N;EASTERN ARABIC-INDIC DIGIT TWO;;;; 06F3;EXTENDED ARABIC-INDIC DIGIT THREE;Nd;0;EN;;3;3;3;N;EASTERN ARABIC-INDIC DIGIT THREE;;;; 06F4;EXTENDED ARABIC-INDIC DIGIT FOUR;Nd;0;EN;;4;4;4;N;EASTERN ARABIC-INDIC DIGIT FOUR;;;; 06F5;EXTENDED ARABIC-INDIC DIGIT FIVE;Nd;0;EN;;5;5;5;N;EASTERN ARABIC-INDIC DIGIT FIVE;;;; 06F6;EXTENDED ARABIC-INDIC DIGIT SIX;Nd;0;EN;;6;6;6;N;EASTERN ARABIC-INDIC DIGIT SIX;;;; 06F7;EXTENDED ARABIC-INDIC DIGIT SEVEN;Nd;0;EN;;7;7;7;N;EASTERN ARABIC-INDIC DIGIT SEVEN;;;; 06F8;EXTENDED ARABIC-INDIC DIGIT EIGHT;Nd;0;EN;;8;8;8;N;EASTERN ARABIC-INDIC DIGIT EIGHT;;;; 06F9;EXTENDED ARABIC-INDIC DIGIT NINE;Nd;0;EN;;9;9;9;N;EASTERN ARABIC-INDIC DIGIT NINE;;;; It was noted that these digits are not considered Arabic digits, but since their names literally have the word "Arabic" in them, this seems incorrect; consider also by that same logic the HANIFI ROHINGYA DIGIT and RUMI digits which are considered in this class.
Date/Time: Tue Oct 26 14:05:04 CDT 2021
Name: Gary L. Wade
Report Type: Error Report
Opt Subject: UnicodeData.txt
Values for U+06F0 to U+06F9 are considered to be European Numbers(EN) rather than Arabic Numbers (AN) for the bidi class, and so based on a bidi property lookup, these are not considered to be "RTL-weak" for lack of a better phrase like values U+0660 to U+0669, and so some algorithms will always consider them to be "LTR-weak". Since these digits are used in Persian, which is an RTL language, these should also have the bidi class of AN just like the HANIFI ROHINGYA DIGIT and RUMI digits. To see the difference between how these digits are laid out unexpectedly, Apple's TextEdit app on the Mac running under US English can be used to enter these with the appropriate Arabic and Persian keyboards on separate lines with a space between each digit: 1. Launch TextEdit on macOS under US English locale 2. Choose Arabic keyboard 3. Type each digit with a space between each (1, space, 2, space, etc.); notice the RTL direction to lay out the text 4. Press the return key to enter a new line 5. Choose Persian keyboard 6. Type each digit with a space between each; notice the LTR direction is used to lay out the text This software and much more expect to use the properties in UnicodeData.txt for the bidi algorithm, and adding an override in each app to make Persian digits RTL goes against its purpose. 06F0;EXTENDED ARABIC-INDIC DIGIT ZERO;Nd;0;EN;;0;0;0;N;EASTERN ARABIC-INDIC DIGIT ZERO;;;; 06F1;EXTENDED ARABIC-INDIC DIGIT ONE;Nd;0;EN;;1;1;1;N;EASTERN ARABIC-INDIC DIGIT ONE;;;; 06F2;EXTENDED ARABIC-INDIC DIGIT TWO;Nd;0;EN;;2;2;2;N;EASTERN ARABIC-INDIC DIGIT TWO;;;; 06F3;EXTENDED ARABIC-INDIC DIGIT THREE;Nd;0;EN;;3;3;3;N;EASTERN ARABIC-INDIC DIGIT THREE;;;; 06F4;EXTENDED ARABIC-INDIC DIGIT FOUR;Nd;0;EN;;4;4;4;N;EASTERN ARABIC-INDIC DIGIT FOUR;;;; 06F5;EXTENDED ARABIC-INDIC DIGIT FIVE;Nd;0;EN;;5;5;5;N;EASTERN ARABIC-INDIC DIGIT FIVE;;;; 06F6;EXTENDED ARABIC-INDIC DIGIT SIX;Nd;0;EN;;6;6;6;N;EASTERN ARABIC-INDIC DIGIT SIX;;;; 06F7;EXTENDED ARABIC-INDIC DIGIT SEVEN;Nd;0;EN;;7;7;7;N;EASTERN ARABIC-INDIC DIGIT SEVEN;;;; 06F8;EXTENDED ARABIC-INDIC DIGIT EIGHT;Nd;0;EN;;8;8;8;N;EASTERN ARABIC-INDIC DIGIT EIGHT;;;; 06F9;EXTENDED ARABIC-INDIC DIGIT NINE;Nd;0;EN;;9;9;9;N;EASTERN ARABIC-INDIC DIGIT NINE;;;;
Date/Time: Thu Nov 25 10:22:52 CST 2021
Name: Giacomo Catenazzi
Report Type: Error Report
Opt Subject: NameAliases.txt
C0 chart (https://www.unicode.org/charts/PDF/U0000.pdf) uses the abbreviation EM for 0x19, but in NamesAlias.txt only EOM is listed as abbreviation. Because EM is used in various ISO (and ANSI, and ECMA, e.g. ECMA-48 and the C0 table is linked also in ECMA-6 [ISO 646]), I think NameAliases.txt should include also a third line: 0019;END OF MEDIUM;control 0019;EOM;abbreviation 0019;EM;abbreviation <- NEW LINE HERE Note: the name 'EM' seems available in Unicode. BTW it seems EOM was previously used in first version of ASCII as abbr. of 0x03 instead of ETX (as end of message) (according Wikipedia and the scanned version). EM will just avoid confusion, and it is more used(you use it on C0 chart).
Date/Time: Wed Dec 1 10:57:02 CST 2021
Name: J. S. Choi
Report Type: Error Report
Opt Subject: UAX44-LM2 medial-hyphen clarification
The UAX44-LM2 rule defines “medial hyphen” as a “hyphen occurring immediately between two letters”; however, it does not clarify whether a medial hyphen also may be between a letter and a numeral. For example, if the answer is yes, then “VARIATION SELECTOR 15” and “VARIATION_SELECTOR_15” would match “VARIATION SELECTOR-15”, and if the answer is no, then they would not match.
Date/Time: Thu Oct 14 10:43:00 CDT 2021
Name: John B
Report Type: Other Question
Opt Subject: New unicode character unclear?
Hello, At the new emoji site, there is one listed: https://unicode.org/emoji/charts-14.0/emoji-released.html #16 is 1FAF1 200D 1FAF2 (left hand, zero width joiner, right hand) -- I'd love to know what this final single emoji is, or if there is any information available about that. Is that a bug?
Date/Time: Mon Nov 8 15:04:37 CST 2021
Name: David Corbett
Report Type: Error Report
Opt Subject: UTS #51
What happens when an emoji_zwj_sequence overlaps a text_presentation_sequence? It is not clear what to do when a text presentation selector appears at the end of an emoji zwj sequence. For example, how should <U+1F408, U+200D, U+2B1B, U+FE0E> be rendered? • The same as <U+1F408, U+200D, U+2B1B> • The same as <U+1F408, U+FE0E, U+2B1B, U+FE0E> • The same as <U+1F408, U+2B1B, U+FE0E> UTS #51 says “A text presentation selector breaks an emoji zwj sequence, preventing characters on either side from displaying as a single image. The two partial sequences should be displayed as separate images, each with presentation style as specified by any presentation selectors present, or by default style for those emoji that do not have any variation selectors.” Taken literally, that means <U+1F408, U+200D, U+2B1B, U+FE0E> is split into two sequences, <U+1F408, U+200D, U+2B1B> and an empty sequence, so the whole thing should be rendered the same as <U+1F408, U+200D, U+2B1B>. That is probably not what was intended.
Date/Time: Sat Sep 18 10:48:29 CDT 2021
Contact: noneed (at) example.com
Name: Jackie
Report Type: Error Report
Opt Subject:
Note: Fake return address was supplied, so cannot contact submitter.
Hi again, The code charts ( https://www.unicode.org/charts/ ) each should include a standard key to the symbols used (e.g., →, ~, ※, etc.). Nothing I see on the code chart PDFs defines these symbols or even links to a definition of them. I looked around and found ( https://www.unicode.org/charts/About.html#Conventions ), but I usually access the code charts from pages that contain no link to that page, and some are saved locally. Thank you!
Date/Time: Sat Sep 18 15:44:27 CDT 2021
Name: David Corbett
Report Type: Error Report
Opt Subject: Mistakes in definition D56
Definition D56 in chapter 3 says “Combining character sequences involving a variation selector (which is both default_ignorable and a combining mark), consist of only the base character followed by a single variation selector”, but that is not true. U+1031 MYANMAR VOWEL SIGN E is not a base character, but it does have a defined variation sequence. Also, you could have a sequence like <U+0030, U+FE0F, U+20E3>, which does not consist of *only* the base character followed by a single variation selector: it consists of the base, the variation selector, and another mark.
Date/Time: Mon Sep 27 18:48:22 CDT 2021
Name: Eduardo Marín Silva
Report Type: Other Document Submission
Opt Subject: On the Tiddu mark and Virama+Repha of Tulu-Tigalari
This is a response to the following document: https://www.unicode.org/L2/L2021/21210-tulu-tigalari.pdf In page 41, section 8.2 it explains the function of the mark and even compares it to a "caret". Currently the dotted circle in the representative glyph, suggests this is a combining sign; but it is my opinion that this should be treated similarly to the caret; a zero advance graphical indicator. This is because the sign is meant to be an after-the-fact addition to the text, which means it should not affect the original spacing of the text at all; this includes vowel signs that apply below the base. If the current model is used, the rendering of the script would become more complicated that it already is. This change would also make it easier to display it in more situations, like after whitespaces or non-letters. The general category of it would be 'Po' and the CCC would be 0.This change of properties would also disambiguate it from other characters like, 208A ₊ SUBSCRIPT PLUS SIGN and 031F ◌̟ COMBINING PLUS SIGN BELOW I would also like to suggest to encode one more character, to reproduce the behavior on page 34, where the Virama and the Repha can fuse, despite them not being adjacent in the sequence. Instead, I propose encoding another character called: TULU-TIGALARI VIRAMA WITH REPHA. This would reduce the complexity necessary to input this character. It can have the same properties as the Virama and be placed at 113DE, so no characters need to be shifted from their current positions.
Date/Time: Fri Oct 1 14:05:49 CDT 2021
Name: David McCreedy
Report Type: Error Report
Opt Subject: The Unicode Standard, Version 14.0.0
FYI: Section 15.15 of The Unicode Standard still lists the old Ahom block range end (Ahom: U+11700–U+1173F) instead of the 14.0 updated range end (U+1174F) at https://www.unicode.org/versions/Unicode14.0.0/ch15.pdf#G95570. Refer to the "11700..1174F; Ahom" line in http://www.unicode.org/Public/UNIDATA/Blocks.txt for confirmation. Thanks.
Date/Time: Fri Oct 1 16:13:29 CDT 2021
Name: Peter Constable
Report Type: Error Report
Opt Subject: Kayah Li code chart / NamesList.txt
Note: This has already been taken into account in the Unicode 15.0 nameslist draft.
In the Kayah Li names list, the following vowel letters are listed under the subhead "Consonants": A922 ꤢ KAYAH LI LETTER A A923 ꤣ KAYAH LI LETTER OE A924 ꤤ KAYAH LI LETTER I A925 ꤥ KAYAH LI LETTER OO In NamesList.txt, the @Vowels subhead follows A925, but should be moved up to follow A921.
Date/Time: Wed Oct 6 14:29:53 CDT 2021
Name: Eduardo Marín Silva
Report Type: Other Document Submission
Opt Subject: Pending errata notices
This is a remainder that certain glyph corrections, lack an errata notice; despite being recommended by the Script Ad-Hoc. Only the first document precedes UTC #169. My intention is avoid the accidental omission of these by having them documented togueter for reference. Canadian Syllabics: https://www.unicode.org/L2/L2021/21141-ucas-revisions.pdf (limited to the 3 yellow highlighted characters) Old Turkic: https://www.unicode.org/L2/L2021/21153-n5163-old-turkic-glyph.pdf Khitan Small Script: https://www.unicode.org/L2/L2021/21182-khitan-mods.pdf Sundanese: https://www.unicode.org/L2/L2021/21221-three-sundanese-chars.pdf
Date/Time: Sat Nov 6 14:45:05 CDT 2021
Name: Jens Maurer
Report Type: Error Report
Opt Subject: NamesList.txt
https://www.unicode.org/Public/14.0.0/ucd/NameAliases.txt says, in particular, # Note that no formal name alias for the ISO 6429 "BELL" is # provided for U+0007, because of the existing name collision # with U+1F514 BELL. 0007;ALERT;control 0007;BEL;abbreviation Yet, https://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt says 0007 <control> = BELL which (according to section 24.1 of the Unicode standard) introduces the normative alias BELL. However, that not desired according to the comment in NameAliases.txt.
Date/Time: Sat Nov 6 14:50:58 CDT 2021
Name: Jens Maurer
Report Type: Error Report
Opt Subject: NamesList.txt
https://www.unicode.org/Public/14.0.0/ucd/NameAliases.txt says 000A;LINE FEED;control 000A;NEW LINE;control 000A;END OF LINE;control meaning that all three aliases are intended to be normative aliases per section 4.8 of the Unicode standard. However, https://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt says 000A <control> = LINE FEED (LF) = new line (NL) = end of line (EOL) meaning that "new line" and "end of line" are not presented as a normative alias in CodeCharts.pdf (because they are not uppercase). (The same situation appears for other control characters that have more than one alias.)
Date/Time: Mon Nov 8 11:00:48 CST 2021
Name: Peter Constable
Report Type: Error Report
Opt Subject: UAX #44
In 5.2, the description for Extended_Pictographic says, "Note: This property is used in the regex definitions for the Default Grapheme Cluster Boundary Specification in UAX #29, Unicode Text Segmentation [UAX29], as well as for the definition ED-4 in UTS #51, Unicode Emoji [UTS51]." It fails to mention use in LB30b that was added to UAX #14 in Unicode 14.
Date/Time: Tue Nov 23 16:53:54 CST 2021
Name: Jonathan Yavner
Report Type: Error Report
Opt Subject: UAX #14
"If U+2061 CAUTION SIGN had been used, which also looks like an exclamation point inside a triangle, ..." But U+2061 is actually "FUNCTION APPLICATION", which has no appearance. The text should read "U+2621 CAUTION SIGN". This error was introduced in version 19 (dated 2006-08-22) and has lain there in plain sight ever since.
Date/Time: Wed Nov 24 14:16:50 CST 2021
Name: Petr Viktorin
Report Type: Error Report (UTR #39)
Opt Subject:
Section 4, Confusable Detection in UTR#39 refers to Section 2.9.1, Backward Compatibility in Unicode Technical Report #36 The correct section number for "Backward Compatibility" is 2.10.1 See: https://www.unicode.org/reports/tr39/#Confusable_Detection https://www.unicode.org/reports/tr36/#Backwards_Compatibility Similar errors appear in 5.2 Restriction-Level Detection, 6 Development Process, 6 Development Process, and 3.1 General Security Profile for Identifiers of UTR#39
Date/Time: Sun Dec 5 00:48:41 CST 2021
Name: Norbert Lindenberg
Report Type: Error Report
Opt Subject: Core specification
The introduction of chapter 16 of the Unicode Standard, “Southeast Asia” states “The scripts of Southeast Asia are written from left to right.” This statement is not correct for all scripts of Southeast Asia; Hanifi Rohingya is written from right to left.
Date/Time: Sun Jan 2 06:27:41 CST 2022
Name: Ivan Panchenko
Report Type: Error Report
Opt Subject: UAX #14 and UAX#44
In the Unicode Standard Annex #14, it is said about hyphenation that in “German and Swedish, a consonant is sometimes doubled”. I suggest changing “German” to “pre-reform German orthography” because nowadays no consonant is struck out compared to the hyphenated form, e.g., “Schifffahrt” is written with three fs even when unhyphenated (pre-reform: “Schiffahrt”, hyphenated “Schiff- / fahrt”). Also, UAX #14 contains the doublings “the the” and “by by”. UAX #44 contains the mistakes “stabiity” (instead of “stability”), “inadvertant” (instead of “inadvertent”), “definining” (instead of “defining”), “discunifications” (instead of “disunifications”), “compatiblity” (instead of “compatibility”) and “"TU-" (kIRG_TSource0 prefix, or 'VU-" (kIRG_VSource0 pefix” (instead of “"TU-" (kIRG_TSource0) prefix, or "VU-" (kIRG_VSource0) prefix”).
Date/Time: Thu Jan 6 12:26:50 CST 2022
Name: John Hudson
Report Type: Error Report
Opt Subject:
Page 488 in the Bengali section of chapter 12 (South and Central Asia-I) of TUS discusses Jihvamuliya and Upadhmaniya in ligatures with following consonant letters, hopefully making it clear to shaping engine implementers that these character sequences should be treated as clusters for shaping purposes. A similar discussion with examples is missing from the Devanagari section of the same chapter. The Devanagari and Bengali handling of Jihvamuliya and Upadhmaniya are graphically distinct but functionally identical, and this should be reflected in parallel discussions, perhaps with added explicit statements that these sequences should be processed as clusters.
Date/Time: Fri Jan 7 16:22:34 CST 2022
Name: Ivan Panchenko
Report Type: Error Report
Opt Subject: UAX #42
UAX #42 contains the following mistakes: “the the” (instead of “the”), “intented” (instead of “intended”), “inheritence” (instead of “inheritance”), “accross” (instead of “across”), “attribues” (instead of “attributes”), “representedy” (instead of “represented”).
Date/Time: Sun Jan 16 10:47:31 CST 2022
Contact: ivanpan3@gmail.com
Name: Ivan Panchenko
Report Type: Error Report
Opt Subject: UTS #51
UTS #51 contains the following mistakes: "a emoji" (instead of "an emoji"), "existing existing" (instead of "existing"), "color which is" (instead of "color is"), "should taken" (instead of "should be taken"), "is all a perfectly legitimate" (instead of "is all perfectly legitimate"), "user‘s" (instead of "user’s", note the apostrophe), "any any" (instead of "any"), "“us’" (instead of "“us”"), "”demon“" (instead of "“demon”", note the quotation marks). In some occurrences of "[CLDR]", the closing bracket is part of the link text.
Date/Time: Mon Jan 17 20:02:43 CST 2022
Name: Norbert Lindenberg
Report Type: Error Report
Opt Subject: ScriptExtensions.txt
The proposal for the Tai Le script, L2/01-369, describes the use of five “existing nonspacing diacritics in the UCS” as tone marks in an older orthography of the script. Apparently this refers to the following characters from the Combining Diacritical Marks block: U+0300 COMBINING GRAVE ACCENT U+0301 COMBINING ACUTE ACCENT U+0307 COMBINING DOT ABOVE U+0308 COMBINING DIAERESIS U+030C COMBINING CARON The Script_Extensions property values of these characters in Unicode 14.0 do not indicate their use in the Tai Le script. They should.