The sections below contain links to permanent feedback documents for the open Public Review Issues as well as other public feedback as of September 14, 2018, since the previous cumulative document was issued prior to UTC #156 (July 2018). Some items in the Table of Contents do not have feedback here.
The links below go directly to open PRIs and to feedback documents for them, as of September 14, 2018.
Issue Name Feedback Link 379 Draft UAX #44, Unicode Character Database (feedback) No feedback at this time 378 Draft UTR #53, Unicode Arabic Mark Rendering (feedback) No feedback at this time
The links below go to locations in this document for feedback.
Feedback to UTC / Encoding Proposals
Feedback on UTRs / UAXes
Error Reports
Other Reports
Note: The section of Feedback on Encoding Proposals this time includes:
L2/15-121R2
L2/17-373R
L2/18-282
Date/Time: Wed Sep 5 13:33:05 CDT 2018
Name: Eduardo Marín Silva
Report Type: Feedback on an Encoding Proposal
L2/18-282
Opt Subject: Encoding model for a newly proposed character
Document: L2/18-282 proposes a new character for the Adlam script, in which they argue that none of the currently encoded characters satisfy the needs for the Adlam script, those are: * Do not induce word or line break opportunities. * Have Lm general category. * Have an straight (not hooked) glyph. * Have right-to-left directionality. * Have a "transparent" joining class. I agree that this requirements are sufficient to merit the separate encoding of a new character, however the currently proposed character has the script property for Adlam, when it can be argued that other orthographies in the future may require such character; similar to the way the ARABIC TATWEEL, was only encoded for the Arabic script but its use quickly expanded to other right-to-left joining scripts. As such I suggest encoding this character as a generic character, with the Common script property, with the name RIGHT-TO-LEFT MODIFIER LETTER APOSTROPHE and to reflect it's new nature be encoded in U+061D which is the last unassigned slot in the Arabic block, with the annotation "used for Adlam" as well as making the proper entry for this character in the script property extensions.
Date/Time: Mon Sep 10 05:12:05 CDT 2018
Name: John Knightley
Report Type: Feedback on an Encoding Proposal
Opt Subject: Response to Proposal to Encode Two Vietnamese Alternate Reading Marks by Lee Collins
In Proposal to Encode Two Vietnamese Alternate Reading Marks by Lee Collins (WG2 N4915, L2/17-373R), a somewhat simplified picture is painted of the two reading marks. For example in the document the impression is given that the reading marks are always placed on the left, but elsewhere the same author talks about the the variant reading mark 个 "cá nháy" being present as the top part of U+2B89A (V4-4078) 𫢚 which is formed form 个 over 衣 see https://hc.jsecs.org/irg/ws2017/app/index.php?id=05027 (also in the recent IRG document http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg51/IRGN2309VietnamReview.pdf ). The addition of the reading mark in this case also requires merging the bottom stroke of the reading mark with the top stroke of the lower part. This complex behaviour is one reason that indicates that encoding as a combining character would not be appropriate would not be an appropriate way to dealing with this reading mark, but rather continuing the existing parctice of encoding via CJK unified ideographs would be best. Some similar problems exist with the other reading mark proposed, as for example shown in figure 4 of the UK response when the reading mark combines with 外 and two strokes are merged. The above, and other omissions, show the existing proposal is not mature, and should not be allowed to proceed.
Date/Time: Sun Jul 29 09:28:10 CDT 2018
Name: David Corbett
Report Type: Error Report
Opt Subject: Misleading phrasing about HYPHEN-MINUS in character names
UAX #34 says “The rule UAX34-R3 specifies that only medial HYPHEN-MINUS characters are ignored in comparison. It is possible for a HYPHEN-MINUS character to occur in initial position (following a SPACE) in a word in a Unicode character name.” That makes it sound like the only possible positions for HYPHEN-MINUS are medial and initial, but it can also be final.
Date/Time: Sat Jul 21 14:45:52 CDT 2018
Name: Karl Williamson
Report Type: Error Report
Opt Subject: Traditional and Simplified Han in UTS 39
Below is <1c9f4bf5-7589-a5a6-ddd2-dd4de4e5d0a0@ix.netcom.com> in which Asmus lays out why a passage from UTS 39 should be retracted The full excerpt from the UTS reads: Mark Chinese strings as “mixed script” if they contain both simplified (S) and traditional (T) Chinese characters, using the Unihan data in the Unicode Character Database [UCD]. The criterion can only be applied if the language of the string is known to be Chinese. So, for example, the string “写真だけの結婚式 ” is Japanese, and should not be marked as mixed script because of a mixture of S and T characters. Testing for whether a character is S or T needs to be based not on whether the character has a S or T variant , but whether the character is an S or T variant. There are several issues with this. First and foremost, the definition of S and T variants is not something that is universally agreed upon. The .cn, .hk or .tw registries are using a definition of S and T variants that does not agree with the Unihan data in many particulars. Therefore, using the Unihan data would result in false positives. (And false negatives). Second, there are many characters that are variants that are acceptable with both "S" or "T" labels. You only have to look at the published Label Generation Rulesets (or IDN tables) for these domains to see many examples. And, as mentioned above, you cannot reverse engineer these tables from Unihan data. Third, the same domains mentioned have a policy of delegating up to three label to the same applicant: a "traditional", "simplified" and a mixed label matching the spelling of the label in the original application (for situations where a mixed label is appropriate). In other words, certain mixed labels are seen as appropriate. Fourth, the Chinese ccTLDs all have a robust policy of preventing any other mixed label that is a variant of the three from being allocated to an unrelated party. If you "know" that the language has to be Chinese, because the domain is a ccTLD, then at the same time the check is superfluous. Other registries are not known to have similar policies, so for them additional spoof detection may be useful --- however it is precisely those cases where it's impossible to know whether a label is intended to be in the Chinese language. Fifth, generally the only thing that can be ascertained is that a label is *not* in Chinese: by virtue of having Kana or Hangul characters mixed in. However, the reverse is not true. You will find labels registered under .jp that do not contain Hiragana or Katakana. Sixth, for zones that are shared by different CJK languages, the state of the art is to have a coordinated policy that prevents "random" variant labels from coexisting in the registry. An example of this kind of effort is being developed for the root zone. By definition, for the root zone, there is no implied information about the language context, unlike the case for the second level, where the presence of a ccTLD in the full domain name may give a clue. Seventh, attempting to determine whether a label is potentially valid based on variant data (of any kind) is doomed, because actual usage is not limited to "pure" labels. The variant mechanism is something that works differently (in those registries that apply it): instead of looking at a single label, the registry can implement "mutual exclusion". Once one variant label from a given set has been delegated, all others are excluded (or in practice, all but three, which are limited to the same applicant). Without access to the registry data, you cannot predict which variants in a set are the "good ones", and with access to the data, spoof labels are rejected and cannot be registered. In conclusion, my recommendation would be to retract this particular passage. A./ On 12/27/2017 1:31 PM, Karl Williamson via Unicode wrote: In UTS 39, it says, that optionally, "Mark Chinese strings as “mixed script” if they contain both simplified (S) and traditional (T) Chinese characters, using the Unihan data in the Unicode Character Database [UCD]. "The criterion can only be applied if the language of the string is known to be Chinese." What does it mean for the language to "be known to be Chinese"? Is this something algorithmically determinable, or does it come from information about the input text that comes from outside the UCD? The example given shows some Hirigana in the text. That clearly indicates the language isn't Chinese. So in this example we can algorithmically rule out that its Chinese. And what does Chinese really mean here?
Date/Time: Sat Aug 4 10:55:32 CDT 2018
Name: David Corbett
Report Type: Error Report
Opt Subject: Coptic Epact Numbers in UAX #31
Table 4 “Candidate Characters for Exclusion from Identifiers” of UAX #31 should list \p{block=Coptic_Epact_Numbers}. It is like the other “inappropriate technical blocks” in that its only XID_Continue character is XID_Continue because it is a combining mark, but it is only useful when used with the other characters in the block, which are not XID_Continue.
Date/Time: Sat Aug 4 11:18:49 CDT 2018
Name: Manish Goregaokar
Report Type: Error Report
Opt Subject: IdentifierType.txt not consistent about Not_XID
https://www.unicode.org/Public/security/11.0.0/IdentifierType.txt categorizes code points by their identifier type from UTS 39 ( https://www.unicode.org/reports/tr39/#Identifier_Status_and_Type Types are allowed to overlap. It seems like there's inconsistency about Not_XID -- not all Not_XID code points are tagged as such. For example, U+0027 APOSTROPHE is just listed as Limited, but it should be Limited Not_XID. The same goes for U+058A ARMENIAN HYPHEN and the other punctuation characters there (except for MIDDLE DOT).
Date/Time: Sat Aug 11 11:50:13 CDT 2018
Name: David Corbett
Report Type: Error Report
Opt Subject: Grapheme_Cluster_Break of U+1B35 BALINESE VOWEL SIGN TEDUNG
UAX #29 assigns some spacing marks Grapheme_Cluster_Break=Extend instead of SpacingMark “for canonical equivalence”. I infer the following rule: any code point which occurs as the non-first code point in a non-Hangul canonical decomposition must have Grapheme_Cluster_Break=Extend. (It would be nice to explicitly state this rule in the annex.) There is one exception: the decomposition of U+1B40 is <U+1B3E, U+1B35> and yet U+1B35 BALINESE VOWEL SIGN TEDUNG has Grapheme_Cluster_Break=SpacingMark.
Date/Time: Sun Aug 12 18:42:31 CDT 2018
Name: David Corbett
Report Type: Error Report
Opt Subject: Context A2 in UAX #31 is too broad
UAX #31 defines context A2 for ZWNJ “in a conjunct context” as /$L $M* $V $M₁* ZWNJ/. That allows ZWNJ at the end of an identifier, where it has no visible effect. The regular expression should be /$L $M* $V $M₁* ZWNJ $L/ instead. Defining the variables in terms of Indic_Syllabic_Category would minimize false positives for the regex too.
Date/Time: Sun Jul 22 22:17:55 CDT 2018
Name: David Corbett
Report Type: Error Report
Opt Subject: U+166D CANADIAN SYLLABICS CHI SIGN
U+166D CANADIAN SYLLABICS CHI SIGN should have General_Category=So and Terminal_Punctuation=No. It is a logogram for “Christ”, not a mark of punctuation. For example, http://www.evertype.com/standards/sl/a08.jpg shows the beginning of the Epistle to Titus. “ᒋᓴᔅ ᙭” appears at the end of the first line and in the middle of the eleventh. Comparing to an English translation shows that they correspond to “Jesus Christ”, not “Jesus” with a punctuation mark, and are not at “the end of textual units”.
Date/Time: Wed Jul 25 09:37:28 CDT 2018
Name: David Corbett
Report Type: Error Report
Opt Subject: Confusables for ARABIC MATHEMATICAL STRETCHED letters
The ARABIC MATHEMATICAL STRETCHED letters should be confusable not with the basic Arabic letters but with the basic letters followed by alef. For example, U+1EE61 should be confusable with the sequence ⟨U+0628 U+0627⟩.
Date/Time: Thu Aug 2 19:18:33 CDT 2018
Name: Ken Lunde
Report Type: Error Report
Opt Subject: Suggested additional kRSUnicode/kRSKangXi property values for U+20063 𠁣 and U+200DB 𠃛
Similar to U+29C0B 𩰋 and U+29C0A 𩰊 that use 191.-5 as their kRSUnicode and kRSKangXi property values, please consider adding 169.-4 as additional kRSUnicode and kRSKangXi property values for U+20063 𠁣 and U+200DB 𠃛, which are arguably more correct, and more importantly, will make them *much* easier to find among the nearly 90K CJK Unified Ideographs.
Date/Time: Tue Aug 7 15:00:48 CDT 2018
Name: David Corbett
Report Type: Error Report
Opt Subject: Hentaigana should have Identifier_Type=Obsolete
U+1B001 HIRAGANA LETTER ARCHAIC YE has Identifier_Type=Obsolete but the rest of the hentaigana (U+1B002 to U+1B11E) have Identifier_Type=Recommended. They are all obsolete.
Date/Time: Wed Aug 8 13:09:47 CDT 2018
Name: David Corbett
Report Type: Error Report
Opt Subject: Identifier_Type of U+05C7 HEBREW POINT QAMATS QATAN
U+05C7 HEBREW POINT QAMATS QATAN is a recent invention, so it should not have Identifier_Type=Obsolete. (Identifier_Type=Uncommon_Use is still appropriate.)
Date/Time: Wed Aug 8 13:34:14 CDT 2018
Name: David Corbett
Report Type: Error Report
Opt Subject: Inconsistent Identifier_Types for excluded scripts
Some old scripts have Identifier_Type=Exclusion|Obsolete whereas most just have Identifier_Type=Exclusion. The former (Ogham, Runic, Old Italic, Gothic, and Deseret) are not more obsolete than the latter, so all those scripts should be Obsolete or none should. Only the original Old Italic and Deseret repertoires have Identifier_Type=Exclusion|Obsolete: the letters encoded later (U+1031F, U+1032D..1032F, U+10426..10427, and U+1044E..1044F) have Identifier_Type=Exclusion. This is even more inconsistent. (Coptic is another script where some letters are Exclusion|Obsolete and some are Exclusion, but the Obsolete ones really are obsolete, so no change is necessary for Coptic.)
Date/Time: Wed Aug 8 20:43:44 CDT 2018
Name: David Corbett
Report Type: Error Report
Opt Subject: Inaccurate synthesis of U+11134 CHAKMA MAAYYAA
The section on Chakma in chapter 13 says “combinations of virama and maayyaa following a consonant are not meaningful, as both kill the inherent vowel”. Because maayyaa is also used as a gemination mark, the sequence ⟨maayyaa, virama⟩ is meaningful.
Date/Time: Thu Aug 9 07:42:48 CDT 2018
Name: David Corbett
Report Type: Error Report
Opt Subject: Ambiguity in how to use Indic siyaq numerals
L2/15-121R2 describes a style of Indic siyaq numerals where a multiple of lakhs or crores is rendered with glyphs resembling U+1EC95 INDIC SIYAQ NUMBER TEN THOUSAND etc. instead of U+1EC7A INDIC SIYAQ NUMBER TEN etc. It offers two possible representations: “This method of writing the tens of lakhs may be mimicked by using the numbers for the ten thousands, whose shapes resemble the modified tens. While this approach does not preserve the semantic value of the number, it does offer a visual solution. [...] Another method might be to produce the alternate display using contextual substitutions in a font.” The Unicode Standard does not explain which solution to use. It should, because implementers of fonts are likely to read the proposal, and be confused, and create incompatible fonts. I suggest using TEN THOUSAND for glyphs that look like TEN THOUSAND, even when when they mean 10, because the encoding of Indic siyaq numerals is in all other respects glyph-based.
Date/Time: Wed Aug 22 12:57:24 CDT 2018
Name: Stephen Davis
Report Type: Error Report
Opt Subject: Cherokee Nation Will Return
I noticed you branched out the Latin Extended category to subgroup D. I noticed that one of the characters (U+A7AE=«Latin Capital Letter SMALL CAPITAL I») in that subgroup appears solidly like a syllable character in the Cherokee block (U+13C6=«Cherokee Letter QUA») and, as some might have pointed out in the past, there are similar visual matchups in the ASCII Latin and Cyrillic blocks (maybe even a mockup or two in the Mathematical Symbol categories. I feel it would be appropriate to consider cross- referencing the characters in the Cherokee block with their lookalikes in other blocks so that future font files who decide to implement Cherokee won't have to "reinvent the wheel" when just knocking off a corner or two will suffice. Thank you.
Date/Time: Sat Aug 18 23:24:57 CDT 2018
Name: Henrique Peron
Report Type: Other Question, Problem, or Feedback
Opt Subject: MS-DOS/IBM-DOS backwards compatibility
Good morning, I have noticed that there's a block called "Latin Extended Additional" with several precomposed vietnamese accented letters. As I understand it, those are provided for backward compatibility purposes with old DOS codepages and other 8-bit character sets while, nowadays, vietnamese computer users rather type postcomposed chars by typing their necessary latin letters and combining them with the necessary diacritic from the block "Combining Diacritical Marks". However, when it comes to the lithuanian language, such support is not available on Unicode. Lithuanian uses, along with ordinary latin letters, nine extra precomposed latin latters with certain diacritical marks. However, they understand them as letters in their own right, such as the spanish "ñ", which has its own position on the alphabet between "N" and "O" and is not considered an accented letter. On those DOS days, there were the codepages CP776, CP777 and CP778, which provided what were called precomposed letters for "accented lithuanian". What is called "accented lithuanian" is the regular set of latin letters, along with those aforementioned nine extra precomposed latin letters with, eventually yet another diacritical mark. There are combinations like "LATIN LETTER A WITH OGONEK AND ACUTE ACCENT" and "LATIN LETTER J WITH TILDE". It poses a distinct situation: when combining the small letter "i" with the acute, grave or tilde, the small letter "i" must retain the tittle, unlike what happens with the "i" on other languages written with the latin script. Last but not least, there are a few other exquisite chars on those three codepages also seemingly used on dictionaries. There's also a Wikipedia page (https://en.wikipedia.org/wiki/Lithuanian_accentuation) which provides more information. Such webpage shows that support for accented lithuanian is not only needed for backward compatibility purposes, in the end. Would the Unicode Consortium be interested in providing such support?
Date/Time: Mon Aug 27 08:16:37 CDT 2018
Name: François
Report Type: Other Question, Problem, or Feedback
Opt Subject: Ou ligature for greek
Hello, I'm surprised the character ȣ isn't include as a greek character, even if it's a ligature. I know this character inclusion proposal has been rejected, but couldn't it be worth reconsidering it? Its use isn't that rare in modern greek graffiti or even in greek trade marks (https://papadopoulou.gr/) or in ancient byzantine texts. I can't see the rational of the inclusion of a character such as ff for latin (not perceived by anyone as different from ff) but not of the inclusion of ȣ whereas is clearly more different from ου. Thank you !