The sections below contain links to permanent feedback documents for the open Public Review Issues as well as other public feedback as of July 11 2022, since the previous cumulative document was issued prior to UTC #171 (April 2022).
The links below go directly to open PRIs and to feedback documents for them, as of July 11, 2022.
The links below go to locations in this document for feedback.
Feedback routed to CJK & Unihan Group for evaluation [CJK]
Feedback routed to Script ad hoc for evaluation [SAH]
Feedback routed to Properties & Algorithms Group for evaluation [PAG]
Feedback routed to Emoji SC for evaluation [ESC]
Feedback routed to Editorial Committee for evaluation [EDC]
Other Reports
Date/Time: Wed May 4 02:13:35 CDT 2022
Name: Jaemin Chung
Report Type: Error Report
Opt Subject: Radical-stroke value for U+2C4F8
The radical-stroke value for U+2C4F8 𬓸 should be changed from the current 115.10 (radical 禾) to 202.3 (radical 黍). cf. U+4D58 䵘 202.9
Date/Time: Wed May 4 16:30:12 CDT 2022
Name: Lee Collins
Report Type: Error Report
Opt Subject: Unihan_Readings.txt
Note: This issue was resolved during UTC #170.
U+7550 kDefinition is "to fill; a foll of cloth". I cannot find a word "foll" in this sense in the English dictionaries I looked at. Perhaps it is an older usage. Or, maybe it is a typo for "roll". Kangxi says that U+7550 is the same as U+5E45 幅 and defines it as "布帛廣也". Perhaps "width of cloth" is a better definition.
Date/Time: Wed Apr 13 16:38:50 CDT 2022
Name: Asmus
Report Type: Error Report
Opt Subject: TUS Chapter 14, section Phags-Pa
(1)I stumbled over a bit of editorial conventions that, while correct, were leading astray. (2)Looks like there's a loosely worded bit that's not actually correct. (1) When I just now opened the section at random, it took me a while to mentally switch gears and realize that "letter o" in the passage quoted below was the Phags-pa letter. The conventions are all clear, if you know them, but 'o' is unfortunately not giving any internal hint that it's derived from a transcription. Wish there was something unobtrusive to help guide the reader. (It didn't help that I had "letter o" - the Latin one - on my mind from some other project). Perhaps add the script name here even if redundant?? --- The invisible format characters U+200D ZERO WIDTH JOINER (ZWJ) and U+200C ZERO WIDTH NON-JOINER (ZWNJ) may be used to override the expected shaping behavior, in the same way that they do for Mongolian and other scripts (see ⁅†Chapter 23, Special Areas and Format Characters†⁆). For example, ZWJ may be used to select the initial, medial, or final form of a letter in isolation: <U+200D, U+A861, U+200D> selects the medial form of the letter o <U+200D, U+A861> selects the final form of the letter o <U+A861, U+200D> selects the initial form of the letter o --- (2) More importantly there seems to be something possibly misstated here: "Conversely, ZWNJ may be used to inhibit expected shaping. For example, the sequence <U+A85E, U+200C, U+A85F, U+200C, U+A860, U+200C, U+A861> selects the isolate forms of the letters i, u, e, and o." It should be the case that: the isolate forms for 'i' and 'o' in this example are only selected if they don't join with surrounding characters across the boundaries of the sequence. (There's nothing in the definition of a sequence that prevents it from being embedded in other text).(Can't be sure, but from the table it looks like all vowels are dual joining). It looks like there's an implicit assumption in the text that the sequence is standalone.
Date/Time: Sat Apr 16 08:57:23 CDT 2022
Name: David Corbett
Report Type: Other Document Submission
Opt Subject: Chorasmian number seven
How should the Chorasmian number seven on page 47 of L2/18-164R2 be encoded? There is no obvious gap or longer stroke. It is therefore not clear how to use U+10FC5..U+U+10FC8 to represent it, or even whether it can be encoded in Unicode.
Date/Time: Fri Apr 22 20:34:20 CDT 2022
Name: David Corbett
Report Type: Other Document Submission
Opt Subject: Dives Akuru line breaking
This is feedback on L2/22-080R. Another script with line breaks between orthographic syllables is Dives Akuru. L2/18-016R “Proposal to encode Dives Akuru in Unicode” says “A word may be broken along orthographic syllables at any position at the end of a line.” U+1193F and U+11941 would get lb=AP, the letters including the independent vowels would get lb=AK, and U+1193E would get lb=VI.
Date/Time: Tue Apr 26 12:15:07 CDT 2022
Name: Sławomir Osipiuk
Report Type: Other Document Submission
Opt Subject: Feedback on L2/22-092 (Proposal to add the currency sign for the POLISH ZŁOTY to the UCS)
I would like to offer additional information which may be of interest to the submitter of L2/22-092. The original proposal omits, to its detriment, that the single-character złoty symbol is also present in the 7-bit character set specified by Polish national standard BN-74/3101-01. As a national standard, this may have more persuasive power for inclusion of this character, and the submitter may want to amend its proposal to include this information. Additionally of potential interest, BN-74/3101-01, being a national version of the 7-bit character set conforming to ISO 646, would seem a natural addition to the ISO International Register of Coded Character Sets per ISO 2022 and ISO 2375 (currently managed by the ITSCJ: https://www.itscj-ipsj.jp/english.html). However, BN-74/3101-01 was never added to the Register for reasons I am not aware of (and the Register itself has not seen any additions since the year 2004). If this character set had been added in the past, then inclusion of the złoty symbol in Unicode/ISO 10646 would have been very likely.
Date/Time: Wed Apr 27 22:09:17 CDT 2022
Name: David Corbett
Report Type: Other Document Submission
Opt Subject: Unclear phrasing re complex quadrats
Section 11.4 says “Sometimes a portion of a graphically complex quadrat could be identified as an atomically encoded character. However, in cases where the use of that atomically encoded character as a component of a quadrat sequence would cause ambiguities or uneven distribution in the structure, then a sequence of simpler hieroglyphs should be used instead, with the appropriate joining controls.” This implies that there exist four contexts for an atomically encoded character. 1. Not in a quadrat sequence 2. Causing ambiguities in a quadrat sequence 3. Causing uneven distribution in a quadrat sequence 4. In a quadrat sequence without any problems Does the fourth context really exist? Does it ever make sense to put an atomically encoded character in a quadrat sequence? I don’t think so: I think the quoted passage means that atomically encoded characters in quadrat sequences should always be avoided, because they are always either ambiguous or uneven. However, that is not actually what it says. That sentence should be reworded to something stronger: changing “in cases where” to “because” would fix it. Alternatively, if the fourth context does exist, it would be helpful for the standard to provide an example.
Date/Time: Wed May 18 02:41:15 CDT 2022
Name: Charlotte Buff
Report Type: Other Document Submission
Opt Subject: Issue with precomposed Todhri characters (L2/22-074)
The recently approved Todhri script (cf. L2/20-188r: Everson, „Proposal for encoding the Todhri script in the SMP of the UCS“) includes two letters that are formed from a base letter plus a dot diacritic: *U+105C9 TODHRI LETTER EI and *U+105E4 TODHRI LETTER U. Per consensus 171-C17, it was decided to encode these as precomposed characters with canonical decompositions featuring U+0307 COMBINING DOT ABOVE, as was suggested in L2/22-074 (Pournader, „Todhri encoding options“). However, this approach is not possible to implement as originally intended. According to section 5.1 of UAX #15, Unicode Normalization Forms: »A canonical decomposable character *must* be added to the list of post composition version exclusions when its decomposition mapping is defined to contain at least one character which was already encoded in an earlier version of the Unicode Standard.« Because COMBINING DOT ABOVE is already encoded, using it as the dot diacritic for Todhri would necessitate adding TODHRI LETTER EI and TODHRI LETTER U to the list of composition exclusions, meaning these two characters could never appear in normalised text. This would make their existence as precomposed characters rather superfluous.
Date/Time: Sun Apr 17 12:41:19 CDT 2022
Name: Karl Williamson
Report Type: Other Document Submission
Opt Subject: NonBidiMirroring.txt
https://www.unicode.org/L2/L2022/22026-non-bidi-mirroring.pdf is a proposal from Kent Karlsson for creation of this UCD file I saw that a proposed response to it was that it was "speculative". I can tell you that Perl 5 already has had to workaround the absence of such information in the UCD, and the presence of this would be helpful going forward. The issue for us is delimiters surrounding string-like constructs. These constructs include literal text, and regular expression patterns, among others. Perl has long allowed one to use any of 4 pairs of delimiters for these, like qr(this is a pattern) The 4 sets are () <> {} []. These stem from before Unicode came along, and now Unicode has added hundreds of potential such delimiters. We've had longstanding requests to use this, and the next release of Perl will add many of them. It would have been better to have used this proposed file if it had existed, and I did go looking for something suitable, to no avail. It would be better in the future to use this file, as it gets updated to correspond with new Unicode versions.
Date/Time: Fri Apr 22 12:02:13 CDT 2022
Name: Tim Pederick
Report Type: Error Report
Opt Subject: tr15-51.html
UAX #15, §1.2 Normalization Forms, says of figures 3 to 6 that "[f]or consistency, all of these examples use Latin characters". This is not true of figure 3, in which the second example uses only the Greek characters U+2126 and U+03A9. (And to be pedantic, figure 5 has an example with only the Common characters U+0032, U+2075, and U+0035.) I don't propose replacing the examples with ones that do use Latin characters, but rather changing the note itself, or even removing it. I'm not really sure what is meant by "for consistency"; is it really "inconsistent" to use non-Latin examples? Is the intent of the note to head off complaints of Latin-script parochialism?
Date/Time: Tue May 3 05:59:18 CDT 2022
Name: Henri Sivonen
Report Type: Error Report
Opt Subject: DUCET
https://github.com/unicode-org/cldr/blob/main/common/collation/he.xml has the following tailoring (apart from script reordering): &[before 2]''<<׳ # GERESH just before APOSTROPHE (secondary difference) &[before 2]'\"'<<״ # GERSHAYIM just before QUOTATION MARK (secondary difference) The other Hebrew-script language in CLDR, Yiddish, has this same tailoring (and further tailorings). https://github.com/unicode-org/cldr/blob/main/common/collation/yi.xml It seems generally unfortunate, both from the user perspective and from the binary size perspective of shipping an implementation, when a language requires a tailoring even though its tailoring doesn't collide with the needs of other languages in CLDR. By hoisting this tailoring into DUCET, Hebrew could use the root collation with script reording, like, for example, Greek and Georgian. The handling of й/Й in the Cyrillic script in DUCET looks like precedent of hoisting collation complexity shared by merely the majority (not even all) of languages for a script into DUCET. In this case, the tailoring applies to both languages for the script. (I'm filing this about DUCET as opposed to filing this about CLDR root, because CLDR root seeks to minimize differences from DUCET.)
Date/Time: Tue May 3 06:00:27 CDT 2022
Name: Henri Sivonen
Report Type: Error Report
Opt Subject: DUCET
https://github.com/unicode-org/cldr/blob/main/common/collation/hy.xml has the following tailoring (apart from script reordering): &ք<և<<<Եւ There are no other Armenian-script languages in CLDR. It seems generally unfortunate, both from the user perspective and from the binary size perspective of shipping an implementation, when a language requires a tailoring even though its tailoring doesn't collide with the needs of other languages in CLDR. By hoisting this tailoring into DUCET, Armenian could use the root collation with script reording, like, for example, Greek and Georgian. The handling of й/Й in the Cyrillic script in DUCET looks like precedent of hoisting collation complexity shared by merely the majority (not even all) of languages for a script into DUCET. In this case, the tailoring applies to the only language for the script. (I'm filing this about DUCET as opposed to filing this about CLDR root, because CLDR root seeks to minimize differences from DUCET.)
Date/Time: Thu May 5 19:38:08 CDT 2022
Name: Karl Wagner
Report Type: Error Report
Opt Subject: UTS #46: UNICODE IDNA COMPATIBILITY PROCESSING
UTS #46 Version: 14.0.0 Date: 2021-08-24 Revision: 27 URL: https://www.unicode.org/reports/tr46/ --- I only just started writing my own implementation of this recently, so apologies if I'm misunderstanding, but there are two locations where code-points are checked. Using the same format as the IdnaTestV2.txt file for describing those locations, they would be P1 and V6 ("Processing" step 1, and "Validation" step 6). - P1 is applied to the entire domain, as given. So it may see (decoded) Unicode text, or Punycode. It takes the value of UseSTD3ASCIIRules in to account, so a domain like "≠ᢙ≯.com" triggers the error at P1 only if UseSTD3ASCIIRules=true, because it contains a code-point which STD3ASCIIRules disallows. "xn--jbf911clb.com" will never trigger the error at this location, regardless of UseSTD3ASCIIRules, because it is just ASCII and hasn't been decoded yet. - V6 is applied to the result of Punycode-decoding a domain label, so it will only see decoded Unicode text. As written, it would appear **not** to take UseSTD3ASCIIRules in to consideration, meaning that both (original inputs) "≠ᢙ≯.com" and "xn--jbf911clb.com" would trigger errors at this location, regardless of UseSTD3ASCIIRules. Here is the text of Section 4.1, Validity Criteria ( https://www.unicode.org/reports/tr46/#Validity_Criteria ), Step 6: > Each code point in the label must only have certain status values according to Section 5, IDNA Mapping Table: > - For Transitional Processing, each value must be valid. > - For Nontransitional Processing, each value must be either valid or deviation. It is not clear whether these status values are supposed to take the value of UseSTD3ASCIIRules in to account. As described above, if this step does not consider UseSTD3ASCIIRules, "≠ᢙ≯.com" and "xn--jbf911clb.com" will always be invalid domains. This leads me to believe that it **should** respect UseSTD3ASCIIRules, otherwise the parameter would be meaningless; it does not matter that P1 considers UseSTD3ASCIIRules, because it will be caught by V6 later anyway. I'll have to apologise again because I am not very familiar with the codebases I am about to cite, but from what I can glean this is actually causing confusion in practice: - Unicode-org implementation of IDNA not appear to consider UseSTD3ASCIIRules here: https://github.com/unicode-org/unicodetools/blob/main/unicodetools/src/main/java/org/unicode/idna/Uts46.java#L610-L625 - This appears to be confirmed by the IdnaTestV2 file. For example, Version 14.0.0 (Date: 2021-08-17, 19:34:01 GMT) lines 571 and 573: [571] xn--jbf911clb.xn----p9j493ivi4l; ≠ᢙ≯.솣-ᡴⴀ; [V6]; xn--jbf911clb.xn----p9j493ivi4l; ; ; # ≠ᢙ≯.솣-ᡴⴀ [573] xn--jbf911clb.xn----6zg521d196p; ≠ᢙ≯.솣-ᡴႠ; [V6]; xn--jbf911clb.xn----6zg521d196p; ; ; # ≠ᢙ≯.솣-ᡴႠ "V6" is not an optional validation step tied to any parameter; it does not appear to be something implementations can decide whether or not it applies to them. It always applies, and these domains should always be considered invalid IIUC, according to the tests. - JSDOM implementation does consider UseSTD3ASCIIRules, considers these to be valid domains: https://github.com/jsdom/tr46/blob/e937be8d9c04b7938707fc3701e50118b7c023a5/index.js#L100 - Browsers effectively do in URLs. Safari 15 and JSOM both consider "http://≠ᢙ≯.com.xn--jbf911clb" to be a perfectly fine URL: https://jsdom.github.io/whatwg-url/#url=aHR0cDovL+KJoOGimeKJry5jb20ueG4tLWpiZjkxMWNsYg==&base=YWJvdXQ6Ymxhbms= So I think it is worth adding an explicit mention of UseSTD3ASCIIRules and whether or not it applies to the mapping table lookup from step V6. Thanks, Karl
Date/Time: Tue May 31 21:17:28 CDT 2022
Name: David Corbett
Report Type: Other Document Submission
Opt Subject: Unclear namespace in UTS #18
UTS #18 says “The namespace for the \p{name=...} syntax is the namespace for character names plus name aliases.” This could be misinterpreted to mean that that namespace excludes code point labels, even though code point labels are discussed earlier in that section. It would be clearer to say “The namespace for the \p{name=...} syntax is the Unicode namespace for character names”, using the term defined in UAX34-D3, which in its next version will mention code point labels.
(None at this time.)
Date/Time: Fri Apr 22 11:11:19 CDT 2022
Name: Tim Pederick
Report Type: Error Report
Opt Subject: UnicodeData.txt
U+33D7 SQUARE PH has a compatibility decomposition mapping of <U+0050 LATIN CAPITAL LETTER P, U+0048 LATIN CAPITAL LETTER H>. This character would appear to be intended to represent the pH measurement in chemistry, and as such the mapping should have had different letter case: <U+0070 LATIN SMALL LETTER P, U+0048 LATIN CAPITAL LETTER H>. The Strong Normalization Stability policy says that this cannot be changed, and perhaps it is sufficiently trivial to be beneath notice, but perhaps it could be documented?
Date/Time: Fri Apr 22 20:39:14 CDT 2022
Name: David Corbett
Report Type: Other Document Submission
Opt Subject: Diaeresis on capital Armenian letters
Chapter 7 says “In Armenian dialect materials, U+0308 COMBINING DIAERESIS, appears over uppercase U+0531 ayb and lowercase U+0561 ayb, and lowercase U+0585 oh and U+0578 vo.” Because all caps is used in Armenian, it appears over uppercase U+0555 oh and U+0548 vo too. http://www.nayiri.com/imagedDictionaryBrowser.jsp?dictionaryId=101&dt=HY_HY&pageNumber=577 has an example with U+0548 in the second headword of the third column and an example with U+0555 in the fourth headword of the third column; the diacritic looks like U+030F but it’s probably just U+0308. Chapter 7 should say that U+0308 is used with all six of these bases. Also, the comma after “DIAERESIS” should be removed.
Date/Time: Wed Jul 6 10:01:07 CDT 2022
Name: Deborah Anderson
Report Type: Other Document Submission
Opt Subject: Sunuwar chart glyph error
Neil Patel noticed that the glyphs for 11BD2 SUNUWAR LETTER SHYELE and 11BDC SUNUWAR LETTER SHYER were swapped in the Sunuwar code chart (p. 14 of L2/21-157R). Cf. p. 7 of the proposal, where the glyphs are correct. The correct glyphs appear in Michel Suignard's ISO/IEC 10646 repertoire proposals post Amd1 (WG2 N5181). The UTC accepted Sunuwar based on L2/21-157R. I recommend the UTC go on record noting the error in the code chart in L2/21-157R, noting that the correct glyphs appear in WG2 N5181.