The sections below contain links to permanent feedback documents for the open Public Review Issues as well as other public feedback as of April 28, 2019, since the previous cumulative document was issued prior to UTC #159 (April 2019).
The links below go directly to open PRIs and to feedback documents for them, as of July 18, 2019.
Issue Name Feedback Link 400 Proposed Update UAX #38, Unicode Han Database (Unihan) (feedback) 399 Proposed Update UAX #45, U-source Ideographs (feedback) 398 Proposed Update UAX #44, Unicode Character Database (feedback) No feedback at this time 397 Proposed Draft UTR #54, Unicode Mongolian 12.1 Baseline (feedback) No feedback at this time 396 Proposed Update UAX #29, Unicode Text Segmentation (feedback) 395 Proposed Update UAX #15, Unicode Normalization Forms (feedback) No feedback at this time
The links below go to locations in this document for feedback.
Feedback to UTC / Encoding Proposals
Feedback on UTRs / UAXes
Error Reports
Other Reports
Note: The section of Feedback on Encoding Proposals this time includes:
L2/10-345
L2/17-236
L2/17-300
L2/17-326
L2/18-182
L2/18-183
L2/18-198
L2/18-242
L2/19-005R
L2/19-091
L2/19-172
L2/19-199
L2/19-203
Date/Time: Tue May 7 15:58:33 CDT 2019
Name: David Corbett
Report Type: Feedback on an Encoding Proposal
Opt Subject: Comments on L2/19-005R
L2/19-005R “Proposal to encode ORIYA SIGN OVERLINE in the UCS” should explain where to put the proposed code point in a syllable. I assume it is either meant to immediately follow the vowel sign or to immediately follow the base consonant. Either way, Indic_Syllabic_Category=Vowel_Dependent is not appropriate for this character: above-base dependent vowels are encoded between pre- and post-base dependent vowels, so the overline would have to follow U+0B47 ORIYA VOWEL SIGN E but precede U+0B3E ORIYA VOWEL SIGN AA. It should instead have InSC=Nukta or Syllable_Modifier, depending on where it is meant to go. The proposal claims that “if the combining macron were to be used, it would not be supported in the general Indic rendering system implementation requirement. If the combining macron were used, script runs in Oriya would break”. That is not true. U+0304 COMBINING MACRON is a common-script character so it does not break script runs. There is no Indic rendering system requirement that all marks be Indic-specific. For example, U+20F0 COMBINING ASTERISK ABOVE is used in Devanagari without any rendering system problems. Therefore that argument should be removed from the proposal.
Date/Time: Sat May 25 11:43:13 CDT 2019
Name: William Overington
Report Type: Feedback on an Encoding Proposal
Opt Subject: Five requested items of feedback for L2/19-203 Working Draft for Proposed Update UTS #51, Unicode Emoji
In L2/19-203 Working Draft for Proposed Update UTS #51, Unicode Emoji there is a request for feedback on five specific issues. Here is my feedback. Issue 1: Length I suggest using the direct method of the tag digits in the uncompressed format. Although compression might save 30% to 40% of bytes for one QID emoji character on its own, the saving in bytes as a percentage of a whole document could be much lower, depending upon how many QID emoji are used in a particular document. In my test fonts I have used large bold glyphs for the TAG Q and for the tag digits and a visible glyph for the CANCEL TAG. I found entering them from the Glyph Browser facility in the Serif Affinity Publisher (beta) OpenType-aware software program to be straightforward. I appreciate that the intention may be for entry of QID emoji into a document typically from a cascading menu system but that might not always be available in every application. Using large visible glyphs for just twelve tag characters is convenient for fontmaking and for use. In my opinion using a compressed format is simply adding a layer of complication for a relatively minimal overall saving of bytes. I opine that it is best to keep the system as simple to use as possible. Issue 2: Tag Base I suggest having one standardized tag base. That will allow for interoperability. Regarding the "first mover" effect that is mentioned if a variety of tag bases are used. That might sound fine if it is all large companies and they all meet up as full members of Unicode Inc., but what if, say, a small European company is the first mover and some time later a large American corporation decides that it is not willing to allow any implied recognition of the small European company and does something different. Suppose then that Unicode Inc. is put in the position of making the decision as to which is to be the tag base to be used for that QID emoji. Unicode Inc. might then find itself in a very awkward situation of what to do, particularly if either of the two businesses felt that they were in the right, either on the basis of being first or on the basis of having a very much larger share of the market. What happens to interoperability if both of the businesses carries on using the base character of its choice? That is just one scenario for one QID emoji. If that or other issues happened for the choice of base character for many QID emoji then there could be much confusion. Although having a fallback character could be helpful in some circumstances it could also introduce uncertainty and confusion over the meaning that the author of a document intended. I therefore suggest that a single standardized base character is used. Issue 3: Sequences It seems to me that it would be better to use the first method suggested and keep it all within UTS #51. It is an added facility that would be thus clearly explained within UTS #51. This would make it easier to understand for people learning and for those who are not within the central group of people habitually working with the documents. Issue 4: Registry The idea of the registry is attractive and could be useful. Yet what would the term "in use" mean. For example, suppose that one Wednesday afternoon and on a few other occasions some university students have a go at specifying some QID items and designing and producing some QID emoji, complete with some fonts, maybe of such things as a statue that is on campus or a few statues and so on from the local town and send some messages to a few newsgroups and add some images on a website and have a good learning experience and some fun doing so and then that is that and they move on to other things, though they might go back to it later. Would Unicode include those QID emoji in the Registry? One of the students might have learned of the registry and send them in requesting inclusion. If the university were in the United Kingdom then they might have also been deposited at the British Library under legal deposit. So would they count for inclusion in the registry? So for Unicode Inc. to have a registry then it would need to decide whether to include absolutely every QID emoji ever produced or to have a threshold of some kind and then having a threshold means that there may be edge cases. Maintaining such a registry could be a lot of work. However, maybe the registry could be in the form of a wiki hosted by Unicode Inc. and people could register their own and a structured list could emerge - (for example: animals, dogs), with Unicode Inc. keeping a watchful eye on it, maybe having a moderation system for contributions to avoid major problems such as someone trying to wipe everything. Issue 5: Limiting the RGI emoji tag sequences set additions What is the issue here? The whole point of QID emoji is that it allows anyone to encode any emoji they choose. The RGI list is really for the manufacturers of equipment and does not affect people just having a go and enjoying themselves. I liken this to font provision on major platforms. Manufacturers supply a selection of fonts to enable people to do lots of things. However, the underlying mechanism of how fonts are handled is that if someone wants to buy a licence for another font from a small business and use it on his or her computer then the underlying architecture of the computer system allows that font to be added and used. Indeed the way that Windows 10 works is that if I make a font myself, not as a commercial venture, just as a sort of hobby, using a fontmaking program, then I can install that font and use it in a desktop publishing program to produce PDF documents. I can then publish the PDF documents on the web and send them to the British Library for legal deposit. It could have been otherwise if the computer systems had been designed differently with only the fonts for a particular program being usable with that program. Then the chances of me making my own fonts and using them would possibly have been non-existent. So, to me, QID emoji is like me, or anyone, being able to produce and use, interoperably, an emoji of my own specification, just as I can produce a font of my own specification. So, a registry that helps consumers so that they can rely on knowing that if they buy a device from one manufacturer that they can use the emoji on it to communicate with people who use a device from another manufacturer, is fine, even good. However, please take care in designing such a system that there is not effectively a block on interoperability with other QID emoji that are not in that list. For example, I have suggested that there could be fonts that have a glyph for just one QID emoji. Unicode Inc. could helpfully encourage manufacturers to include software in their devices such that if a QID emoji not in the RGI list is received then a search is made on the internet for such a font rather than just flagging that the particular QID emoji is not supported. William Overington Saturday 25 May 2019
Date/Time: Tue May 28 18:18:36 CDT 2019
Name: Eduardo Marín Silva
Report Type: Feedback on an Encoding Proposal
Opt Subject: Ascia symbol disunification
Proposal L2/19-091, proposes to encode two symbols to represent an "ascia", one left facing and another right facing. The ad hoc however, disagreed that a disunification between the two wasn't merited because of the following rationale: "While examples are provided in running printed text, there is no contrastive use showing distinct semantics of the two. In our view, only one symbol needs to be encoded, until contrastive use in text demonstrating the need for two symbols is provided." This however, is not a good reason for unification. Usage of both variants of the symbols on the same source, is attested in figures 10, 14, 15, 19 and 23 at least (some on the same inscription even). This is no coincidence, because the custom of prefering one layout or orientation over another is information that may be of interest for historians, epigraphist and philologists. Perhaps at this moment it isn't important to distinguishing them, but the same was said of many other scribal practices. It is clear that the inscriptions bearing both versions, did so for aesthetic reasons, unifying both orientations makes it more likely such information would be lost. It is a question of foresight, because if only one symbol is encoded now and the comitee changes its mind later, then fonts may not agree on their prefered orientation for the symbol (since the comitee ruled them to be equivalent). This would result in headaches for font developers and users alike. While this scenario is unlikely to cause major problems, another possibility is that this is taken as precedent to unify revesed variants of characters when they shouldn't be unified, instead of as a warning against overunification. For these reasons I mantain that the symbols shall be disunified.
Date/Time: Tue May 28 19:09:47 CDT 2019
Name: Eduardo Marín Silva
Report Type: Feedback on an Encoding Proposal
Opt Subject: Cross Patty vs Maltese Cross
In proposal L2/19-076 Everson makes the case for two things: 1. Disunify two cross like glyps (cross patty and maltese cross) that are currently unified under 2720. 2. Encode three more characters (two being cross patties with missing pieces and another for the "true maltese cross", to clarify the confusion). While the ad hoc did not recommend to accept any new character, the comitee accepted to encode the two cross patties with missing pieces. I am in favor of the decision of the comitee, however there are three other steps that I would take: 1. Rename the two relevant characters to: CROSS PATTY WITHOUT LEFT CROSSBAR and CROSS PATTY WITHOUT RIGHT CROSSBAR respectively. 2. Change the glyph of the maltese cross to be that of the proposed "CROSS OF MALTA" on Everson's proposal (this will require a glyph erratum notice). 3. Encode a regular CROSS PATTY character with the glyph to be that of a "true" cross patty, as expressed in the proposal. The rationale for doing 1, is that these names are less ambiguos than the current ones, since the main difference of these characters from a regular cross patty, is that they LACK pieces, not that they have extra. For 2 and 3, it is the fact that the creators of the dingbats, either mistook the name or the glyph of the character, and such a mistake has been passed down into the Unicode codecharts, when such a situation is not necessary. Asking the creator of the font would be useful somewhat, unfortuneatly Hermann Zapf died in 2015, and so cannot be contacted to clarify his intent, as well as ask for his take. A possible replacement for such an action, would be to contact the International Typeface Coorporation and/or other friends of Zapf for their take. Type foundries can also be asked if the change of glyph would affect them a lot. This is different to Everson's proposal, in that it reduces the confusion of having two very similarly named characters, in favor of bothering font developers. This is a better solution, because while glyphs are subject to subsecuent corrections, a character name cannot be changed (hence the awkward formal alias system), and doing 2 while not doing 3 still benefits citizens and historians of Malta, that don't want the confusion to continue. Doing 3 is still justified based on the attestations provided in the proposal. This would mean, that both glyphs would be represented and one would not need to find "true" maltese crosses in running text, like the ad hoc asked to. So everybody wins in the end. Evidence of the non identity of both crosses (although sufficiently demonstrated in the proposal), is further illustrated by the wikipedia articles on them both: https://en.wikipedia.org/wiki/Maltese_cross https://en.wikipedia.org/wiki/Cross_patt%C3%A9e Which note their different origin and connotations, as well as references to the 2207 character.
Date/Time: Tue May 28 19:58:14 CDT 2019
Name: Eduardo Marín Silva
Report Type: Feedback on an Encoding Proposal
Opt Subject: Retraction on my opinion of the nature of the THORN WITH DIAGONAL STROKE
In document L2/17-326 I presented my feedback on two encoding proposals, one on the the thorn with diagonal stroke (L2/17‐236), and another for tironian letters (L2/17‐300). Later, Michael Everson and Andrew West presented a revised document, to just focus on the main casing pair of the tironian letters (L2/19-172). This resulted in an updated response from my part (L2/19-199), where I changed my mind, and agreed with them in that it makes complete sense to have an ortographic casing pair, but still disagreed on the exact encoding model. Since then I have exchanged correspondence with Everson on the subject (West won't answer my emails though and Everson has not responded to my request to add him to the chain), we are still in disagreement and while Everson provided some points I could attack on an updated proposal, he stopped responding to me. However, in that chain, I commited myself to submit a contact form clarifying my updated view on the THORN WITH DIAGONAL STROKE, hence this document. Funnily enough my change of opinion was not promted by Everson's input, but rather by the insight of Peter Stokes in the document L2/18-242 (it just has taken me this long to express my change of opinion), in which he confirms, that indeed there is a consistent pattern of glyph distinction between both Old Norse and Old English, and that it is quite likely scholarly publications would want to include text in both languages and the unifiucation would result problematic to them. This effectively retires the last valid criticism against its encoding, since there is plenty of precedent of glyphic variants being disunified due to preferences of distinct languages communities. Even if this semantic distinction was not created in the middle ages, the convention is still rather old and the unification would still remain problematic for medieval transcriptions.
Date/Time: Tue Jun 4 13:17:54 CDT 2019
Name: William Overington
Report Type: Feedback on an Encoding Proposal
Opt Subject: Emoji and Colour L2/19-203 and L2/18-198
In L2/19-203 there is section 2.9 Color. In 2018 I submitted the document L2/18-198. This was included in the Agenda document L2/18-182 as item E.1.7.1 but is not mentioned in the Minutes document L2/18-183. So when the Unicode Technical Committee considers section 2.9 of L2/19-203 could the Unicode Technical Committee please consider whether the idea presented in L2/18-198 presents, in the opinion of the Unicode Technical Committee, a better solution for the way to encode colour for emoji? William Overington Tuesday 4 June 2019
Date/Time: Tue Jun 4 18:37:44 CDT 2019
Name: Charlotte Buff
Report Type: Public Review Issue
Opt Subject: Feedback on Proposed QID Emoji Mechanism
I wanted to inform the UTC of some critical issues concerning the proposed update to UTS #51 allowing emoji to be encoded as Wikidata QID tag sequences. I fully agree with the feedback Andrew West provided in April (cf. https://www.unicode.org/L2/L2019/19124-pubrev.html) but there are additional points he did not address. == Duplicate Encoding == If any object or concept with an associated QID can be represented as a tag sequence, then that includes every object or concept that has already been encoded as a regular emoji. Mount Fuji is both the character U+1F5FB (🗻) and the sequence Q39231 (🆔); strawberries are both the character U+1F353 (🍓) and the sequence Q14458220 (🆔). Unicode exists to transmit information in a uniformly agreed‐upon format, so there must never be two different sequences of codepoints representing the exact same concept unless that difference can be folded away through normalisation. All QID sequences would be valid and official by default even if nothing supported them; the model could not work any other way. For the QID proposal to be usable, the standard would need to explicitly disallow sequences that correspond to existing emoji (which requires a steadily updated database linking all emoji to their QIDs), thus any corporation or private person implementing such duplicate sequences would be operating outside of the specifications. This problem already exists to a lesser extent for emoji flags because some regions are listed as part of both ISO 3166‐1 and 3166‐2, such as American Samoa, which is either AS (🇦🇸) or US‐AS (🏴). It could be useful to list such duplications in the standard to prevent overly eager implementations from accidentally supporting two instances of the same flag at once. == Stability == If any object or concept with an associated QID can be represented as a tag sequence, then no such object or concept can ever be encoded as a regular emoji. The QID for almonds is Q184357. Therefore, the almond emoji already exists (🆔) and just needs to be implemented by vendors. This means an almond character or ZWJ sequence can never be added to Unicode Emoji in the future or else there would be duplicate encoding again. It isn’t even possible to add emoji for things that *don’t* have an associated QID because there is always the possibility of one being created in the future. You cannot fundamentally change the canonically correct representation of a piece of text without invalidating all prior versions in the process. That is why the Unicode standard guarantees absolute stability for many important properties. If, say, Apple decided to support an almond emoji as a QID sequence to “test the waters” so to speak, then people with Apple devices would use this emoji just like any other; they would include it in e‐mails, post it on Twitter or Facebook, send it to their friends on Android phones and so on. And unlike private‐use characters, these sequences are officially part of the standard; they exist in the public independently of any private agreements. If it turns out that the almond emoji is popular enough to consider it for inclusion in Unicode, then its only possible representation would be as that specific QID sequence because plenty of data containing that sequence already exists; there could never be a character called ALMOND because it would either break existing data by replacing the QID sequence in usage, or create a situation where the same emoji is encoded twice in mutually incompatible ways. Searching and indexing any file containing emoji would become potentially impossible. The QID mechanism would mean that no emoji character could ever be added to Unicode again, no ZWJ sequence could ever be approved, and no existing character could ever be emojified because QID sequences already cover all of them, or could cover them in the future. This includes the entire list of candidates for Emoji 13. == Fallback Display and Accessibility == The fallback behaviour of QID sequences, like for all emoji tag sequences, is worthless. The tag characters are invisible by design and were encoded for language‐based font variant substitution, something that the Unicode standard considers unnecessary for understanding the meaning of a text, so taking these very same characters and making them the sole carriers of semantic content in emoji sequences was an inappropriate idea from the start. A user could have full font support for all characters in a given sequence and still remain completely unaware that such a sequence was even received because its only visible component is the tag base – the only part that does not carry any information. The recommendations for how invalid or unsupported tag sequences should be displayed have been part of UTS #51 for as long as the concept of tag sequences itself, but not a single font or text renderer has implemented any of them. The closest vendors have gotten is newer versions of Android displaying unknown regional flags as white flags with superimposed question marks, which is still useless but slightly less so, because at least there is a way to differentiate flag sequences from the plain WAVING BLACK FLAG. This issue would be amplified by QID sequences because they would also have a meaningless tag base in addition to an undetectable tag spec. At the very least 🏴 actually is a flag even if it doesn’t say which one, but 🆔 is nothing at all to the end user. The general public would have to learn that seeing SQUARED ID in a message they received probably means that their conversation partner used an emoji that their own device does not support, an emoji that could depict literally anything at all with little to no clues as to its identity. There isn’t even any way to differentiate one unsupported QID emoji from another. Using different tag bases depending on the entity in question is also problematic as was already discussed in the review notes for the UTS #51 update. The whole point of relying on Wikidata was to ensure that every concept would have one and only one canonical identifier; allowing variable bases or just picking whatever base is implemented first as the official one (the “first mover” approach) contradicts this paradigm completely. Furthermore, if there exists an emoji that can serve as acceptable fallback for a QID sequence, then the emoji represented by that sequence is probably so similar to the bare tag base that it doesn’t need to be added anyway. Screen readers would choke on QID sequences as well. With regional flags, an advanced screen reader could in theory read out the tag spec of unrecognized sequences to give the user some general idea of what was transmitted (“Flag: G, B, N, I, R”) and maybe some people would even recognise the region code, although I am not aware of any such software presently supporting this approach. A really advanced tool could even have an internal look‐up table for region codes (or just a subset of popular region codes) to read out the region’s name even if font support does not exist. With QID sequences, however, this would not work because QIDs are meaningless on their own. Hearing “Q184357” read out loud would probably spark more confusion than just leaving out the emoji entirely, and a database with tens of millions of entries means that creating a sensible subset of items to support would be very difficult.
Date/Time: Wed Apr 10 14:56:44 CDT 2019
Name: Andrey
Report Type: Error Report
Opt Subject: tr51
Ed Note: This has already been addressed in the UTS working draft. There is no open PRI at this time for UTS #51.
Hi, Could you give some explanations about Emoji EBNF and Regex. Why Regex and EBNF use '+' quantifier, in case of '*'? Basic Emoji with 1 code point will never match this regex. \p{RI} \p{RI} | \p{Emoji} ( \p{EMod} | \x{FE0F} \x{20E3}? | [\x{E0020}-\x{E007E}]+ \x{E007F} )? (\x{200D} \p{Emoji} ( \p{EMod} | \x{FE0F} \x{20E3}? )?)+ possible_emoji := flag_sequence | zwj_element ( (\x{200D} zwj_element)+ | tag_modifier)
Date/Time: Wed May 8 07:11:34 CDT 2019
Name: Shinyu Murakami
Report Type: Error Report
Opt Subject: Line breaking should be possible between alphanumeric and fullwidth opening punctuation
I had reported this issue to Chromium Bugs, see <https://bugs.chromium.org/p/chromium/issues/detail?id=827538>, and got an answer "this is UAX #14 issue". http://www.unicode.org/reports/tr14/tr14-43.html#LB30 > LB30 Do not break between letters, numbers, or ordinary symbols and opening or closing parentheses. > > (AL | HL | NU) × OP > > CP × (AL | HL | NU) > > The purpose of this rule is to prevent breaks in common cases where a part of a word appears between delimiters—for example, in “person(s)”. I think the problem is that the OP of this rule includes all opening punctuations. The solution will be to divide OP into OP1 and OP2, where OP1 includes normal "(" and "[" and OP2 includes fullwidth "(", "[", "「", etc. (East_Asian_Width F, W and H) and change the LB30 rule "(AL | HL | NU) × OP" to "(AL | HL | NU) × OP1".
Date/Time: Tue May 28 12:07:33 CDT 2019
Name: William Overington
Report Type: Feedback on an Encoding Proposal
Opt Subject: On the validity of an encoding of a QID emoji as mentioned in L2/19-203 Working Draft for Proposed Update UTS #51, Unicode Emoji
On the validity of an encoding of a QID emoji as mentioned in L2/19-203 Working Draft for Proposed Update UTS #51, Unicode Emoji In L2/19-203 Working Draft for Proposed Update UTS #51, Unicode Emoji there is, in section C.2, the following. quote A sequence of TAG characters corresponding a Q followed by a sequence of one or more digits, corresponding to a valid Wikidata QID representing a depictable object. end quote and also the following quote A subset of QIDs are associated with entities that would be valid for emoji. For example, risk management (Q189447) and this (Q3109046) would not be valid. Of those that are valid, Wikidata may not have associated images for the referenced entity, and such images would rarely — if ever — be appropriate for use as images for emoji. end quote I suggest that there should not be that restriction and that all QID items should be valid for QID emoji and thus for interchange and interoperability in a plain text environment. Some may never be used yet I am thinking that to state that that some "would not be valid" would be a decision that could restrict progress and the implementation and beneficial application of new ideas in the future. There is also the practical problem of how such a rule could be precisely specified. Also, the word 'depict' as defined in the Oxford English Dictionary seems to mean that a QID emoji of each QID item would be valid under the quoted definition from the L2/19-203 document. https://en.oxforddictionaries.com/definition/depict This seems to come back to the issue of whether emoji can be of abstract designs rather than just of physical objects. In my opinion, restricting emoji to images of physical objects is unnecessary and undesirable as it would limit creativity and opportunities for communication of ideas. In my opinion the expression of ideas using abstract designs is an important part of human culture. As it happens when we were discussing the possibility of abstract emoji some time ago in the public mailing list I produced glyphs for "this" and for "that" as a gentleman had indirectly suggested the possibility. They are about 60% of the way down the following web page. http://www.users.globalnet.co.uk/~ngo/abstract_emoji.htm I accept that "this" as in "this and that" is not the same as "this" as used in some computer languages, yet maybe, just maybe, a glyph for "this" used in that context could be like my design for a glyph for "this" with a large round dot, say in green, added in the lower right corner, so as to indicate a dot as used in listing the name of an object in some computer programming languages. Restricting which QID items could be emoji also restricts the possibility of using the QID page data for text to speech. For example, risk management (Q189447) already has text in three languages. The encoding of abstract items as QID items and thus as QID emoji could help communication through the language barrier, including possibly very helpfully in emergency situations. I have devised a glyph for risk management. The glyph is of a red jagged shape enclosed within a yellow rounded shape for the colourful version. The monochrome version being of a solid jagged shape enclosed within the outline of a rounded shape. Shapes something like those in the following article. https://en.wikipedia.org/wiki/Bouba/kiki_effect William Overington Tuesday 28 May 2019
Date/Time: Mon Jul 1 11:23:31 CDT 2019
Name: David Corbett
Report Type: Error Report
Opt Subject: Code point labels vs. names in UTS #18
Section 2.5 of UTS #18 recommends supporting code point labels like \p{name=private-use-E000}. It allows supporting aliases not in the UCD, but warns that they may clash with future official names or aliases. There is nothing preventing the encoding of a character with the name PRIVATE-USE-E000, so the same warning should apply to code point labels. Alternatively, there could be a stability guarantee that no character’s name will ever look like a code point label.
Date/Time: Mon Jul 1 11:50:10 CDT 2019
Name: David Corbett
Report Type: Error Report
Opt Subject: Problems with hyphens in character names in UTS #18
Section 2.5 of UTS #18 says “Name matching rules follow Matching Rules from [UAX44].” UAX44-LM2 says to ignore all medial hyphens except the one in U+1180. However, section 2.5.1 says to ignore hyphens when matching names for \N, with three exceptional pairs: U+0F68 vs. U+0F60, U+0FB8 vs. U+0FB0, and U+116C vs. U+1180, “where an extra test shall be made for the presence or absence of a hyphen”. Is it intentional that \p{name} and \N use different fuzzy match rules? For example, \p{name=TIBETAN LETTER-A} matches U+0F68 TIBETAN LETTER A because, per UAX44-LM2, a medial hyphen is equivalent to a space; \N{TIBETAN LETTER-A} matches U+0F60 TIBETAN LETTER -A because it contains a hyphen. This seems confusing. For another example, \p{name=TIBETAN MARK BKA SHOG YIG MGO} matches nothing because the hyphen in U+0F0A TIBETAN MARK BKA- SHOG YIG MGO is not medial; \N{TIBETAN MARK BKA SHOG YIG MGO} does match U+0F0A because the hyphen is ignored. Also, “an extra test shall be made for the presence or absence of a hyphen” is unclear. Would \N{TIBETAN LET-TER A} match U+0F60 TIBETAN LETTER -A because a hyphen is present? There are more than three pairs of characters whose names differ only by a hyphen: there are also U+11A00 vs. U+11A29, U+11A50 vs. U+11A7A, and U+11C8F vs. U+11C88.
Date/Time: Tue May 7 16:29:23 CDT 2019
Name: David Corbett
Report Type: Error Report
Opt Subject: Saurashtra C2-conjoining forms
The Saurashtra section says “An exception to the non-occurrence of complex consonant clusters is the conjunct ksa, formed by the sequence <U+A892, U+A8C4, U+200D, U+A8B0>. [...] If necessary, U+200D ZERO WIDTH JOINER may be used to force conjunct behavior.” That implies that the syllable “kra” in the old-fashioned style would be formed by the sequence <ka, virama, ZWJ, ra>. That is the opposite of the usual Indic practice (see PR #37) where a C2-conjoining form (as in Saurashtra) is formed by <ZWJ, virama, C2>. I recommend using <ZWJ, virama>. I know of only two Saurashtra fonts: Pagul and Noto Sans Saurashtra. Neither really supports C2-conjoining forms. (Pagul supports one but it doesn’t even use ZWJ at all. Noto supports only a couple specific syllables.) So I wouldn’t worry about breaking any existing text. Since “kṣa” is an atomic conjunct, either order could work. Using <ka, ZWJ, virama, ṣa> seems more consistent with other syllables, but using <ka, virama, ZWJ, ṣa> would be more compatible with previous versions of TUS and would allow <ka, ZWJ, virama, ṣa> to request a non-atomic “kṣa”, so I recommend keeping “kṣa” as it is. In any case, I recommend documenting all of this explicitly.
Date/Time: Sat Apr 27 20:56:17 CDT 2019
Name: David Corbett
Report Type: Error Report
Opt Subject: Bad advice about precomposed Egyptian hieroglyphs
The section on Egyptian Hieroglyph Format Controls says “Some Egyptian hieroglyphs with complex structures have previously been encoded as single characters. When glyphs for these single characters are available in the font, the precomposed hieroglyphs should be used instead of complex sequences of hieroglyphs with appropriate joining controls”. This makes the encoding of Egyptian hieroglyphs depend on the choice of font, which is inappropriate for plain text. The standard should not include the phrase “When glyphs for these single characters are available in the font”.
Date/Time: Thu May 9 14:28:21 CDT 2019
Name: David Corbett
Report Type: Error Report
Opt Subject: More Identifier_Type=Technical characters
The following should have Identifier_Type=Technical for consistency with other phonetic symbols. • U+0560 ARMENIAN SMALL LETTER TURNED AYB • U+0588 ARMENIAN SMALL LETTER YI WITH STROKE • U+A78E LATIN SMALL LETTER L WITH RETROFLEX HOOK AND BELT • U+A7AF LATIN LETTER SMALL CAPITAL Q • U+A7BA LATIN CAPITAL LETTER GLOTTAL A • U+A7BB LATIN SMALL LETTER GLOTTAL A • U+A7BC LATIN CAPITAL LETTER GLOTTAL I • U+A7BD LATIN SMALL LETTER GLOTTAL I • U+A7BE LATIN CAPITAL LETTER GLOTTAL U • U+A7BF LATIN SMALL LETTER GLOTTAL U • U+A7FA LATIN LETTER SMALL CAPITAL TURNED M
Date/Time: Thu May 9 15:17:02 CDT 2019
Name: David Corbett
Report Type: Error Report
Opt Subject: Letter ra in Pali in Shan State
L2/10-345 says that in Pali as written in Shan State, the letter ra is U+AA79 MYANMAR SYMBOL AITON TWO. UTN #11 says nothing about that, implying that ra is actually the similar-looking U+101B MYANMAR LETTER RA. Which code point should be used for that ra? If U+AA79 should, then it should have gc=Lo and InSC=Consonant. If the glyph shown in L2/10-345 is just a stylistic variant of U+101B, then no action is needed.
Date/Time: Thu May 9 16:11:03 CDT 2019
Name: David Corbett
Report Type: Error Report
Opt Subject: Ambiguous precomposed hieroglyphic quadrats with format controls
The section on Egyptian Hieroglyph Format Controls says to use precomposed hieroglyphs when available. One of the examples in Table 11-2 is that 13217 is preferred over 13216:13216:13216. How should a quadrat of four stacked 13216s be encoded: 13216:13216:13216:13216, 13216:13217, or 13217:13216? It is ambiguous. I think the advice should be to prefer precomposed hieroglyphs when they are the entire quadrat, but not to use precomposed hieroglyphs in a complex cluster with format controls.
Date/Time: Sun May 19 18:19:24 CDT 2019
Name: David Corbett
Report Type: Error Report
Opt Subject: Identifier_Type of U+10A0D KHAROSHTHI SIGN DOUBLE RING BELOW
U+10A0D KHAROSHTHI SIGN DOUBLE RING BELOW has Identifier_Type=Technical|Exclusion. It is not a technical character. It should just have Identifier_Type=Exclusion.
Date/Time: Thu May 23 05:27:36 CDT 2019
Name: Mike FABIAN
Report Type: Error Report
Opt Subject: The Emoji ZWJ Sequence “people holding hands”,
which appeared in Emoji-12.0/Unicode-12.0 has “10.0” in the comments in
emoji-zwj-sequences.txt
The emoji zwj sequence: 1F9D1 200D 1F91D 200D 1F9D1 “people holding hands” first appeared in https://www.unicode.org/Public/emoji/12.0/emoji-sequences.txt it is not in: https://www.unicode.org/Public/emoji/11.0/emoji-sequences.txt So apparently this was added in 12.0. But https://www.unicode.org/Public/emoji/12.0/emoji-sequences.txt contains 10.0 in the comments: $ grep "people holding hands" emoji-zwj-sequences.txt 1F9D1 200D 1F91D 200D 1F9D1 ; Emoji_ZWJ_Sequence ; people holding hands # 10.0 [1] 1F9D1 1F3FB 200D 1F91D 200D 1F9D1 1F3FB ; Emoji_ZWJ_Sequence ; people holding hands: light skin tone # 10.0 [1] ... ETC ... 1F9D1 1F3FF 200D 1F91D 200D 1F9D1 1F3FF ; Emoji_ZWJ_Sequence ; people holding hands: dark skin tone # 10.0 [1] $ That seems to be a mistake.
Date/Time: Mon Jul 1 12:50:01 CDT 2019
Name: David Corbett
Report Type: Error Report
Opt Subject: Underdefined code point label syntax
Subsection “Code Point Labels” in section 4.8 of The Unicode Standard says “code point labels are constructed by using a lowercase prefix derived from the code point type, followed by a hyphen-minus and then a 4- to 6-digit hexadecimal representation of the code point.” This is a mite ambiguous. May the hexadecimal representation use lowercase letters or fullwidth characters, since they match \p{Hex_Digit}? Is the hexadecimal representation allowed to have extra leading zeros, e.g. <control-000009>? (This matters because code point labels are part of UTS #18 syntax.)
Date/Time: Thu Jul 4 23:01:44 CDT 2019
Name: David Corbett
Report Type: Error Report
Opt Subject: Misleading description of Manichaean ligatures
The Manichaean section says “Manichaean has two obligatory ligatures for sadhe followed by yodh or nun”, but they are not obligatory and they are not always ligatures. See https://github.com/googlefonts/noto-fonts/issues/1550 for details and evidence.
Date/Time: Mon Jul 8 02:10:57 CDT 2019
Name: Ivan Timokhin
Report Type: Error Report
Opt Subject: Inconsistency in name derivation rules ranges
Ed Note: This has already been fixed for next version.
There appears to be a mismatch between ranges for name derivation rules listed in Table 4-8 of the Standard v12.1 (p. 185, https://www.unicode.org/versions/Unicode12.1.0/ch04.pdf) and the contents of UnicodeData.txt (https://www.unicode.org/Public/12.1.0/ucd/UnicodeData.txt). Namely, Table 4-8 contains ranges 4E00..9FEA for "CJK UNIFIED IDEOGRAPH" and 17000..187EC for "TANGUT IDEOGRAPH", whereas corresponding ranges in UnicodeData.txt end at 9FEF and 187F7 respectively. Furthermore, code charts actually contain entries for the rest of these ranges, and, for what it's worth, ICU appears to report names generated by the corresponding rules for these characters. This, together with the disclaimer at the beginning of Chapter 4, suggests to me that it is the Table 4-8 that is incorrect.
Date/Time: Mon Jul 15 10:31:38 CDT 2019
Name: Ken Lunde
Report Type: Error Report
Opt Subject: Kanbun block property versus implementation issue
With regard to the 16 characters in the Kanbun (漢文) block, the last 14 of them, U+3192 through U+319F, have the <super> (superscript) property, and include a compatibility decomposition (NFKC and NFKD) to a CJK Unified Ideograph: https://www.unicode.org/charts/PDF/U3190.pdf These characters are referred to as "kaeriten" (返り点), and are used to annotate Chinese texts (aka Kanbun). They have been in Unicode from the beginning (Version 1.0), and are not included in any JIS standard, other than in JIS X 0221 that is effectively a clone of ISO/IEC 10646. Very few non-Japanese fonts include glyphs for these characters, because their use is Japanese-specific. However, the glyphs for these characters in the vast majority of Japanese fonts are provided at full size with the expectation that the layout engine reduce them to half size and position them appropriately. Some typefaces, such as the Hiragino families and Kozuka Mincho, provide generic glyphs for these characters that do not vary by weight, but other typefaces include glyphs that do vary by weight. Their glyphs may or may not be identical to their corresponding CJK Unified Ideographs, but all such known implementations use separate GIDs (Glyph IDs) for them. These characters are similar to Kenten (圏点) characters whose glyphs are also provided at full size with the same expectation from the renderer. Adobe InDesign, from Version 1.0J, supports the typesetting of Kenten characters, such as U+2022 • BULLET, U+25B2 ▲ BLACK UP-POINTING TRIANGLE, U+25B3 △ WHITE UP-POINTING TRIANGLE, U+25C9 ◉ FISHEYE, U+25CB ○ WHITE CIRCLE, U+25CE ◎ BULLSEYE, U+25CF ● BLACK CIRCLE, U+25E6 ◦ WHITE BULLET, U+FE45 ﹅ SESAME DOT, and U+FE46 ﹆, and is bundled with a dedicated font that provides their glyphs (at full size). In terms of Unicode, the Kanbun subsection of Section 18.1, Han, of the Core Specification (page 720) states only the following about this particular block: "This block contains a set of Kanbun marks used in Japanese texts to indicate the Japanese reading order of classical Chinese texts. These marks are not encoded in any other current character encoding standards but are widely used in literature. They are typically written in an annotation style to the left of each line of vertically rendered Chinese text. For more details, see JIS X 4051." JIS X 4051:2004 is still the latest version of that referenced standard, which was reconfirmed on 2018-10-22, and JLREQ, which implemented much of the functionality of that standards, provides no information about Kanbun. Section 5 of JIS X 4051 (pp 35 through 44) covers Kanbun, and states the following in Section 5.4: 返り点,送り仮名及び読み仮名の文字サイズ: 返り点,送り仮名及び読み仮名の文字サイズは,漢文の漢字の文字サイズの 1/2 とする。処理系定義により,返り点,送り仮名及び読み仮名は,漢文の漢字の文字サイズの 1/2 以下としてもよい。 The above statement seems very clear that the glyphs for kaeriten (返り点), along with those for okurigana (送り仮名) and yomigana (読み仮名), are to be scaled to one-half size (or smaller). See Noto CJK Issue #159 for some spirited discussion, which includes some unfortunate misunderstandings on my part (facepalm): https://github.com/googlefonts/noto-cjk/issues/159 My question to the UTC, and to character property experts in particular, is whether the normative <super> property and associated decomposition, which are in conflict with how virtually all Japanese fonts supply the glyphs for the Kanbun block at full size with the expectation that they be scaled and positioned as appropriate, is an issue in a practical sense, or at some level. I am asking, because we're facing what I consider to be two effective non-starters here: 1) changing the property value; and 2) changing hundreds of Japanese fonts. I do plan to change the typefaces under my control, meaning the Noto CJK (and Source Han) families and Kozuka Gothic, to use generic glyphs for these characters that do not vary by weight, but otherwise have no plans to adjust their size or positioning.
Date/Time: Mon Jul 15 17:48:53 CDT 2019
Name: Jaemin Chung
Report Type: Feedback on an Encoding Proposal
Opt Subject: Some incorrect radical-stroke values in Ext G
Some radical-stroke values in Extension G are incorrect. http://unicode.org/L2/L2019/19227-n5100r2-10646-6th-ed-cd3-chart.pdf U+30059: 7.11 → 7.12 U+300E3: 12.8 → 12.9 U+3010E: 15.15 → 15.13 If reordering of characters is still possible, these pairs need to be swapped: U+30059 and U+3005A U+300E3 and U+300E4
Date/Time: Mon Jul 15 20:10:14 CDT 2019
Name: Eiso Chan
Report Type: Public Review Issue
Opt Subject: Comment on the RS for U+3010E (GZ-4571201) in WG2 N5100R2
There is no need to change the RS value for U+3010E (GZ-4571201), because we have confirmed to count the SC of the component 争 as the one of 爭 in IRG, that they are the unifiable case.
Date/Time: Tue Jul 16 23:01:39 CDT 2019
Name: William T. Nelson
Report Type: Error Report
Opt Subject: U+2DCE7 glyph issue
The glyph for U+2DCE7 𭳧 (KC-06432) in the CJK extension F code chart renders incorrectly in some applications on macOS, including Preview, Safari, and Firefox. This glyph renders correctly on iOS (all apps) and in Chrome for macOS. Here's a screenshot: https://wtnelson.s3-us-east-2.amazonaws.com/unicode/U_2DCE7_glyph.png Addendum: I flipped through the charts and found more cases. Here are all of them: block code_point char_ref U U+6C08 K0-6E7D U U+860B V1-6568 B U+25706 UCS2003 B U+25990 UCS2003 B U+2620F UCS2003 B U+26286 UCS2003 B U+2822D UCS2003 B U+28F17 UCS2003 B U+29F4A UCS2003 E U+2CC56 GXC-3023.15 F U+2DCE7 KC-06432 Thanks, William
Date/Time: Tue Jul 16 12:27:04 CDT 2019
Name: Jaemin Chung
Report Type: Feedback on an Encoding Proposal
Opt Subject: More radical-stroke value errors in Ext G
I found more radical-stroke value errors in Extension G. U+30A12: 113.12 → 113.10 (should be moved between U+30A0F and U+30A10) U+30A3F: 115.15 → 115.16 (U+30A3F and U+30A40 should be swapped) U+30AD0: 119.12 → 119.11 (no reordering needed) U+31173: 188.11 → 188.12 (no reordering needed) U+311CB: 195.21 → 195.23 (U+311CB and U+311CC should be swapped)
Date/Time: Tue Jul 23 14:00:41 CDT 2019
Name: David Corbett
Report Type: Error Report
Opt Subject: Identifier_Type of IPA characters
Some characters have Identifier_Type = Uncommon_Use that are actually in common use as IPA symbols. They should have Technical but not Uncommon_Use. • U+0253 LATIN SMALL LETTER B WITH HOOK • U+0254 LATIN SMALL LETTER OPEN O • U+0256 LATIN SMALL LETTER D WITH TAIL • U+0257 LATIN SMALL LETTER D WITH HOOK • U+025B LATIN SMALL LETTER OPEN E • U+0260 LATIN SMALL LETTER G WITH HOOK • U+0263 LATIN SMALL LETTER GAMMA • U+0268 LATIN SMALL LETTER I WITH STROKE • U+026F LATIN SMALL LETTER TURNED M • U+0272 LATIN SMALL LETTER N WITH LEFT HOOK • U+0275 LATIN SMALL LETTER BARRED O • U+0280 LATIN LETTER SMALL CAPITAL R • U+0283 LATIN SMALL LETTER ESH • U+0288 LATIN SMALL LETTER T WITH RETROFLEX HOOK • U+0289 LATIN SMALL LETTER U BAR • U+028A LATIN SMALL LETTER UPSILON • U+028B LATIN SMALL LETTER V WITH HOOK • U+0292 LATIN SMALL LETTER EZH • U+0294 LATIN LETTER GLOTTAL STOP
Date/Time: Tue Jul 23 14:09:43 CDT 2019
Name: Norbert Lindenberg
Report Type: Error Report
Opt Subject: Incorrect script code in UAX 24
Unicode® Standard Annex #24, UNICODE SCRIPT PROPERTY, contains the phrase "so it is assigned an scx set value of {Hira Kata}". "Kata" is an incorrect script code. The ISO 15924 script code for Katakana is "Kana", which is already used correctly in the table above the incorrect use.
Date/Time: Sun Jul 7 13:41:43 CDT 2019
Name: Ken Martin
Report Type: Submission (FAQ, Tech Note, Case Study)
Opt Subject: LOINC codes Unicode emoji Skin Type
I'm not sure if you are aware, but LOINC recently released their new set of codes. I had submitted Unicode emoji Skin Type modifiers and Fitzpatrick Skin Type questions to LOINC. The Fitzpatrick questions will probably be released in December, but the Unicode emoji skin type modifier codes are out. I believe these are the first LOINC codes for emojis. I looked at the emojis for pain scales, but the faces appeared to differ in the literature. Perhaps this is a project you can do, since pain emojis are clinically useful. You can view the complete Skin type emoji codes in LOINC. But the codes are: 89843-7 Unicode Emoji skin tone modifier PREFERRED ANSWER LIST (LL5071-7) Source: Unicode, Inc. SEQ# Answer Answer ID 1 Light skin tone Unicode: 1F3FB EMOJI MODIFIER FITZPATRICK TYPE-1-2 Description: Emoji Modifier Fitzpatrick Type-1-2 LA29279-9 2 Medium-light skin tone Unicode: 1F3FB EMOJI MODIFIER FITZPATRICK TYPE-1-2 Description: Emoji Modifier Fitzpatrick Type-3 LA29280-7 3 Medium skin tone Unicode: 1F3FD EMOJI MODIFIER FITZPATRICK TYPE-4 Description: Emoji Modifier Fitzpatrick Type-4 LA29281-5 4 Medium-dark skin tone Unicode: 1F3FE EMOJI MODIFIER FITZPATRICK TYPE-5 Description: Emoji Modifier Fitzpatrick Type-5 LA29282-3 5 Dark skin tone Unicode: 1F3FF EMOJI MODIFIER FITZPATRICK TYPE-6 Description: Emoji Modifier Fitzpatrick Type-6 LA29283-1