L2/22-018

Comments on Public Review Issues
(September 26, 2021 - January 18, 2022)

The sections below contain links to permanent feedback documents for the open Public Review Issues as well as other public feedback as of January 18, 2022, since the previous cumulative document was issued prior to UTC #169 (October 2021).

Contents:

The links below go directly to open PRIs and to feedback documents for them, as of January 18, 2022.

Issue Name Feedback Link
441 Proposed Update UAX #29, Unicode Text Segmentation (feedback)
440 Proposed Update UTS #10, Unicode Collation Algorithm (feedback)
439 Proposed Update UAX #50, Unicode Vertical Text Layout (feedback) No feedback at this time
438 Proposed Update UAX #44, Unicode Character Database (feedback)
437 Proposed Update UAX #38, Unicode Han Database (Unihan) (feedback) No feedback at this time
436 UTS #37 Unicode Ideographic Variation Database (feedback) No feedback at this time
435 Unicode Emoji 15.0 Provisional Candidates (feedback)
434 CLDR Person Name Formatting (feedback)
427 Proposed Update UTS #18, Unicode Regular Expressions (feedback)

The links below go to locations in this document for feedback.

Feedback routed to Unihan ad hoc for evaluation
Feedback routed to Script ad hoc for evaluation
Feedback routed to Properties & Algorithms ad hoc for evaluation
Feedback routed to Emoji SC for evaluation
Feedback routed to Editorial Committee for evaluation
Other Reports

 


Feedback routed to Unihan ad hoc for evaluation

Date/Time: Fri Oct 8 20:34:45 CDT 2021
Name: Eduardo Marín Silva
Report Type: Other Document Submission
Opt Subject: Recomended addition of UAX #38

I would like to apologize if in the past I wasn't as helpful to the Unihan
group as I should have.

This is a more succinct recommendation to the Unihan group, to add the
number of entries of each field in the description boxes of the document in
question: https://unicode.org/reports/tr38/. Since Dr. Lunde already
compiled the most the number of entries in this document:
https://docs.google.com/spreadsheets/d/1_ad7Z9qqMONlK5SUfNjaSXSNIaG-0POQKTlUqCvxdhI/edit#gid=559817095 
it would be trivial to add the most up to date counts in new versions of
UAX #38. This would be convenient for users, that might not want to look at
different documents for that info. This info can give a sense to users of
how large each field is and comparing the counts in future versions can
also reflect the growth of the database.

Date/Time: Sat Nov 20 04:41:17 CST 2021
Name: Eiso Chan
Report Type: Error Report
Opt Subject: Unihan Database

The kMandarin value for U+266E8 𦛨 is lao. Maybe we need to modify it to láo
based on the corresponding Traditional variant U+6725 朥. This character is
very common in Teochow-Swatow Min-dialects for the local food 𦛨饼, but it’s
a pity that it has not been included in TGH.

Date/Time: Sat Nov 27 22:35:35 CST 2021
Name: Jerry Rossignuolo
Report Type: Error Report
Opt Subject: 19227-n5100r-10646-6th-ed-cd3-chart.pdf

Hello,

I see you have the radical "⺜" interpreted as the radical "sun". ⺜ is used
as variant of 日 (sun) in the 新华字典部首 (XinHuaZiDian BuShou). Yet not all
variants of a BuShou per GF 0011-2009 are of the same radical.

Most of the material I am finding lists the radical ⺜ is a variant of 冃 with
the meaning of cap or hat. I believe this dates back to the Shuowen Jiezi
(说文解字) dictionary. http://www.shuowen.org/?bushou=%E5%86%83 

I also believe ⺜ as a radical is named 冒字头 which further has me thinking
this radical has the meaning of cap or hat.

Yet, I am not sure. Is it possible if you could clarify this for me? I ran
across this while working on a Chinese language learning tool and need to
correctly identify the meaning of ⺜.

Thanks,
Jerry Rossignuolo

Date/Time: Sat Dec 18 19:28:56 CST 2021
Name: Richard Hsieh
Report Type: Error Report
Opt Subject: CJK chart 4E30 mixed up

4E30 丰 having HB1-A4A5 and T1-4464 that are the Traditional Chinese 
character.  The rest of the characters are the Simplified Chinese c
haracters.  They are two different characters and cannot be mixed up.  
Could not come up with the Traditional character for the name of a 
person and other things because of softwares that carried both at 
the same time could not tell apart but to placed the Simplified 
Chinese in placed of the Traditional Chinese character.

Date/Time: Tue Jan 4 07:17:52 CST 2022
Name: Andrew Christopher West
Report Type: Other Document Submission
Opt Subject: CJK Ext. H U+31682 (UK-10989)

L2-21/053 "Additional repertoire for a future version of Unicode
(post Unicode 14.0)" lists the proposed code chart for CJK Unified
Ideographs Extension H. The character at U+31682 (UK-10989) is indexed as
Radical 40 plus one residual stroke, and placed first under radical 40
(宀). This is clearly wrong because: 1) the character does not include
radical 40; and 2) the total stroke count is 9. I suggest changing back to
the original proposed index of radical 25 plus 7 residual strokes, and
reordering after U+31455.

Date/Time: Thu Jan 6 20:28:54 CST 2022
Name: Ken Lunde
Report Type: Error Report
Opt Subject: Unihan Database

The kTotalStrokes property value of U+2AB8F 𪮏 (⿰手思) should be changed 
from 12 to 13, because its indexing radical is composed of four strokes, 
not three.

Date/Time: Thu Jan 6 07:53:34 CST 2022
Name: Ivan Panchenko
Report Type: Error Report
Opt Subject: Unihan_Readings.txt

I want to report some mistakes in the Unihan Database definitions
(kDefinition): “from from” (instead of “from”), “disturbe”
(instead of “disturb”), “pon your mind” (instead of “on your
mind”), “thron” (instead of “thorn”), “flaten”(instead of “flatten”), “name
name” (instead of “name”), “chrysanthemun”
(instead of “chrysanthemum”), “purpurca” (instead of “purpurea”), “force fo
arms” (instead of “force of arms”), “phtholein”
(instead of “phthalein”), “the the” (instead of “the”), “ber eaten”
(instead of “be eaten”), “bured” (instead of “buried”), “askewd”
(instead of “askew”), “foll of cloth” (instead of “roll of
cloth”), “smilling” (instead of “smiling”), “without friends or relativ”
(instead of “without friends or relatives”), “longtum”
(instead of “longum”), “themedia forskali” (instead of “Themeda
forskalii”), “circium” (instead of “Cirsium”), “bracenia”
(instead of “Brasenia”), “artemesia” (instead of “artemisia”), “corp of a
bird” (instead of “crop of a bird”), “stellariana”
(instead of “stelleriana”), “eumenes polifomis” (apparently should
be “Eumenes pomiformis”, but is this definition actually
correct??), “loquatious” (instead of “loquacious”), “interprete”
(instead of “interpret”), “liesure” (instead of “leisure”), “mischevious”
(instead of “mischievous”), “fy” (instead of “fry”), “incorruptable”
(instead of “incorruptible”), “repse” (instead of “repose”).

The following words should be
capitalized: “sanskrit”, “buddhist”, “daoist”, “pekinese”, “persian”.
Scientific names are capitalized as well, so this should be
corrected: “malus”, “canis”, “ursus”, “rubia”, “plantago”, “piper”, “caryopteris”,
“hydropyrum”, “artemisia
stelleriana”, “gracilaria”, “vitis”, “valeriana”, “pteris”, “ligusticum”, “allium”,
“cyperus”, “lophanthus”, “arca”, “libellulidae”, “vipera”, “brachyura”, “acrida”,
“cosmopsaltria”, “parasilurus”, “spheroides”, “coryphaena”, “pagrosomus”, “treron”,
“grus”.

This is questionable, but I am unsure what this is supposed to
mean: “leucacene”, “suffle”.

Date/Time: Mon Jan 10 18:45:29 CST 2022
Name: Jaemin Chung
Report Type: Error Report
Opt Subject: Defect report on USourceData.txt

In USourceData.txt, some IDSes have semicolons in them. This is bad 
because a semicolon is already used as a delimiter.

Here are the IDSes with semicolons:

UTC-03134;B;U+28559;162.9;;⿺辶⿹&P7-03;刀;UTCDoc L2/17-204;;13 12;3
UTC-03143;WS-2017;;145.12;;⿳⿱&H5-01;冖石衣;UTCDoc L2/17-204;;18;1
UTC-03156;WS-2017;;94.8;;⿱&H8-01;犬;UTCDoc L2/17-204;;12;1

Feedback routed to Script ad hoc for evaluation

Date/Time: Wed Sep 29 14:19:34 CDT 2021
Name: Eduardo Marín Silva
Report Type: Other Document Submission
Opt Subject: On the apparent arbitrary exclusion of the play button as a disctinct character on the face of duplicates

The relevant symbols discussed are old (since at least the period where
cassette players where popular); they were later adopted in so many
contexts, that they could be said to be universal representations of their
respective functions.

Naturally, since they were  (and still are) so important, most of them were
assigned a Unicode codepoint on the "Miscellaneous Technical" block, with
some being apparently duplicated. Here I proceed to discuss those:

⏴⏵⏶⏷ (23F4-23F7) and 🞀🞁🞂🞃 (1F780-1F783):

  Both of them seem to serve the same purpose, with the second set having
  the term "ISOSCELES RIGHT TRIANGLE" being applied instead of
  simply "TRIANGLE" to distinguish them. Both sets are isosceles and have
  right angles so the differences in name are not helpful. In practice, it
  seems like the first set tends to have consistent advance width with
  padding at all sides, while the other set tends to have a tight advance
  width with respect to the glyph, which means that the up and down arrows
  end up slightly wider than the left and right ones. If this is the "true"
  difference between them, then the name chosen does not reflect that, and
  it is unclear why they couldn't be unified anyway. i.e. why was it
  important to have both sets?

23F9 ⏹ BLACK SQUARE FOR STOP, 25A0 ■ BLACK SQUARE, 25FC ◼ BLACK MEDIUM
SQUARE, 2B1B ⬛ BLACK LARGE SQUARE,  2BC0 ⯀ BLACK SQUARE CENTRED and 1F532 🔲
BLACK SQUARE BUTTON:

  Out of all of them, the most generic is 25A0, perhaps it was disunified
  into 23F9 because on user interfaces it is important that all buttons
  have the same width, while 25A0 was free to lack padding at both sides.
  2BC0, the "centered" one forms part of a set, where "centered" just means
  the figures have consistent padding at both sides. The last character is
  disunified on account of the different function in UI's where it has a
  dual and represent a selected or unseleceted button. Similarly, while
  disunifying on account of size makes sense, either  25FC or 2B1B could
  have been used for the "stop" function if only one of them was declared
  to be it.

23FA ⏺ BLACK CIRCLE FOR RECORD, 25CF ● BLACK CIRCLE, 26AB ⚫ MEDIUM BLACK
CIRCLE and 1F534 🔴 LARGE RED CIRCLE:

  A similar situation to the "stop" symbol applies to the "record" one, with
  one caveat; the symbol is often shown with a red color. With this in
  mind, not only does it make sense to disunify it from 25CF, it also makes
  sense to disunify it from 26AB and 1F53A, on account of the stability of
  their colors. So there are no problematic disunifications here. Except
  maybe 25CF ● and 2022 • BULLET, but that is independent of the issue at
  hand.

The only symbol to NOT be disunified was the "play" symbol, the closest
matches being 25B6 ▶ BLACK RIGHT-POINTING TRIANGLE and 2BC8 ⯈ BLACK MEDIUM
RIGHT-POINTING TRIANGLE CENTRED. It makes little sense to disunify the
symbols already discussed, but not this one. Whatever rationale applied to
the other characters, should also apply to this one

I therefore highly recommend to encode a new symbol, The glyph would
harmonize great with the other symbols, since it can have a smaller glyph
and the padding necessary at the same time. Disunification also has the
benefit of allowing fonts to depict the symbols inside an enclosure by
default, since that is what users often expect.

I suggest the name BLACK RIGHT-POINTING TRIANGLE FOR PLAY or BLACK
RIGHT-POINTING EQUILATERAL TRIANGLE FOR PLAY. If a separate document needs
to be written for it I would gladly do so.

Date/Time: Wed Oct 13 15:00:17 CDT 2021
Name: Eduardo Marín Silva
Report Type: Other Document Submission
Opt Subject: Suggestion on the encoding of Latin Theta

In the document L2/21-206, it is suggested to encode a Latin casing pair for
Theta. The difference with the Greek pair Θθ (0398 and 03B8) is that the
capital form, always has a horizontal stroke that touches both sides of the
letter, while the Greek letter can have a shorter stroke with its own
serifs. Another similar pair is the Latin Ɵɵ (019F and 0275), the
difference with this pair, is that the lowercase is at x height, while the
orthography requires a tall glyph like the proper Greek small theta.
Encoding a new pair is problematic, since phonetic notations already use
the Greek codepoint (03B8) for the same sound. I propose some possible
solutions.

1. Use the Greek pair: In order to force the preferred glyph for Latin based
orthographies, a SVS can be added, called "latin form" or "long stroke
form". This would mean that the default glyph is still what the Greek users
expect, and the small Theta remains untouched.

2. Use the Latin barred o pair: Similarly, in order to force the preferred
glyph on the lowercase, a SVS can be added, called "theta form", "tall
form" or "elongated form". This has the benefit of keeping the text
completely Latin. Characters that are confusable with others, but only in
certain contexts, is not new.

(Deciding between 1 or 2 depends of what the users prefer in case the
default glyph has to be displayed; either an uppercase with a shorter
stroke or a shorter lowercase)

3. Just encode a small Latin Theta and make it an alternate lowercase to
019F: Such a solution is my least preferred one, since it has the same
downsides as just encoding a new Latin pair.

4. Bite the bullet and encode the new pair: It wouldn't be the first time
confusable characters are disunified due to problematic casing relations.

All other letters in the document are acceptable, but I would rename the
first pair as LATIN CAPITAL/SMALL LETTER REVERSED GLOTTAL STOP. They should
be disunified on the same basis of the regular glottal stop pair.

Date/Time: Tue Nov 16 12:31:51 CST 2021
Name: Jack Varanelli
Report Type: Error Report
Opt Subject: Unicode request for legacy Malayalam

To whom it may concern:

I am a student with no real position in Unicode.  However, I noticed that
the names of U+0272 and the proposed character for U+1DF27 in this
document [https://www.unicode.org/L2/L2021/21156-legacy-malayalam.pdf] have
the same name (LATIN SMALL LETTER N WITH LEFT HOOK).  This has been added
to the Unicode Pipeline, so I am left to assume its inclusion is planned.
Knowing this may cause confusion, I'd advise a name change to the proposed
character, if possible.

Apologies if this was intentional and an oversight on my part.

Sincerely,
Jack Varanelli

Feedback routed to Properties & Algorithms ad hoc for evaluation

Date/Time: Sat Sep 18 15:50:43 CDT 2021
Name: David Corbett
Report Type: Error Report
Opt Subject: Mistake about U+0953 and U+0954

Chapter 12 says “Because U+0953 and U+0954 are not intended to be used 
with the Devanagari script, they have no explicit property values for 
Indic_Positional_Category and Indic_Syllabic_Category”, but that is 
not true. They both still have the explicit Indic_Positional_Category 
value of Top.

Date/Time: Fri Oct 15 17:59:27 CDT 2021
Name: Yannick Duchêne
Report Type: Error Report
Opt Subject: UAX29

Referring to version 13, unless I’m wrong, the sample at line #1725 of 
WordBreakTest.txt, exposes a case of a grapheme being broken.

The test sequence is: ALetter RI ZWJ RI RI ALetter
As graphemes, I believe it is: (ALetter) (RI ZWJ) (RI RI) (ALetter)
But the sample says, as words, it is: (ALetter) (RI ZWJ RI) (RI) (ALetter)
The third grapheme, (RI RI), is broken in two parts, its first RI goes 
to one word and its second RI, to another word.


The comment is correcte about the rule applied, so may be this is an 
unintended effect of the rules for word boundaries or for grapheme 
boundaries in UAX #29. It may be not intended, since §6 says “The 
other default boundary specifications never break within grapheme clusters”.

Date/Time: Mon Oct 25 15:20:07 CDT 2021
Name: Gary Wade
Report Type: Other Document Submission
Opt Subject:

Originally submitted against CLDR at https://unicode-org.atlassian.net/browse/CLDR-15118 

If there is a better avenue, please provide a direct link as I saw no other
appropriate place to do so.

Persian digits (U+06F0-U+06F9) are not considered Arabic Numbers in UnicodeData.txt

Values for U+06F0 to U+06F9 are considered to be European Numbers rather
than Arabic Numbers, and so based on a bidi property lookup, these are not
considered to be "RTL-weak" for lack of a better phrase like values U+0660
to U+0669, and so some algorithms will always consider them to
be "LTR-weak".

06F0;EXTENDED ARABIC-INDIC DIGIT ZERO;Nd;0;EN;;0;0;0;N;EASTERN ARABIC-INDIC DIGIT ZERO;;;;
06F1;EXTENDED ARABIC-INDIC DIGIT ONE;Nd;0;EN;;1;1;1;N;EASTERN ARABIC-INDIC DIGIT ONE;;;;
06F2;EXTENDED ARABIC-INDIC DIGIT TWO;Nd;0;EN;;2;2;2;N;EASTERN ARABIC-INDIC DIGIT TWO;;;;
06F3;EXTENDED ARABIC-INDIC DIGIT THREE;Nd;0;EN;;3;3;3;N;EASTERN ARABIC-INDIC DIGIT THREE;;;;
06F4;EXTENDED ARABIC-INDIC DIGIT FOUR;Nd;0;EN;;4;4;4;N;EASTERN ARABIC-INDIC DIGIT FOUR;;;;
06F5;EXTENDED ARABIC-INDIC DIGIT FIVE;Nd;0;EN;;5;5;5;N;EASTERN ARABIC-INDIC DIGIT FIVE;;;;
06F6;EXTENDED ARABIC-INDIC DIGIT SIX;Nd;0;EN;;6;6;6;N;EASTERN ARABIC-INDIC DIGIT SIX;;;;
06F7;EXTENDED ARABIC-INDIC DIGIT SEVEN;Nd;0;EN;;7;7;7;N;EASTERN ARABIC-INDIC DIGIT SEVEN;;;;
06F8;EXTENDED ARABIC-INDIC DIGIT EIGHT;Nd;0;EN;;8;8;8;N;EASTERN ARABIC-INDIC DIGIT EIGHT;;;;
06F9;EXTENDED ARABIC-INDIC DIGIT NINE;Nd;0;EN;;9;9;9;N;EASTERN ARABIC-INDIC DIGIT NINE;;;;

It was noted that these digits are not considered Arabic digits, but since
their names literally have the word "Arabic" in them, this seems incorrect;
consider also by that same logic the HANIFI ROHINGYA DIGIT and RUMI digits
which are considered in this class.

Date/Time: Tue Oct 26 14:05:04 CDT 2021
Name: Gary L. Wade
Report Type: Error Report
Opt Subject: UnicodeData.txt

Values for U+06F0 to U+06F9 are considered to be European Numbers(EN) rather
than Arabic Numbers (AN) for the bidi class, and so based on a bidi
property lookup, these are not considered to be "RTL-weak" for lack of a
better phrase like values U+0660 to U+0669, and so some algorithms will
always consider them to be "LTR-weak".  Since these digits are used in
Persian, which is an RTL language, these should also have the bidi class of
AN just like the HANIFI ROHINGYA DIGIT and RUMI digits.

To see the difference between how these digits are laid out unexpectedly,
Apple's TextEdit app on the Mac running under US English can be used to
enter these with the appropriate Arabic and Persian keyboards on separate
lines with a space between each digit:

1. Launch TextEdit on macOS under US English locale
2. Choose Arabic keyboard
3. Type each digit with a space between each (1, space, 2, space, etc.); notice the RTL direction to lay out the text
4. Press the return key to enter a new line
5. Choose Persian keyboard
6. Type each digit with a space between each; notice the LTR direction is used to lay out the text

This software and much more expect to use the properties in UnicodeData.txt
for the bidi algorithm, and adding an override in each app to make Persian
digits RTL goes against its purpose.

06F0;EXTENDED ARABIC-INDIC DIGIT ZERO;Nd;0;EN;;0;0;0;N;EASTERN ARABIC-INDIC DIGIT ZERO;;;;
06F1;EXTENDED ARABIC-INDIC DIGIT ONE;Nd;0;EN;;1;1;1;N;EASTERN ARABIC-INDIC DIGIT ONE;;;;
06F2;EXTENDED ARABIC-INDIC DIGIT TWO;Nd;0;EN;;2;2;2;N;EASTERN ARABIC-INDIC DIGIT TWO;;;;
06F3;EXTENDED ARABIC-INDIC DIGIT THREE;Nd;0;EN;;3;3;3;N;EASTERN ARABIC-INDIC DIGIT THREE;;;;
06F4;EXTENDED ARABIC-INDIC DIGIT FOUR;Nd;0;EN;;4;4;4;N;EASTERN ARABIC-INDIC DIGIT FOUR;;;;
06F5;EXTENDED ARABIC-INDIC DIGIT FIVE;Nd;0;EN;;5;5;5;N;EASTERN ARABIC-INDIC DIGIT FIVE;;;;
06F6;EXTENDED ARABIC-INDIC DIGIT SIX;Nd;0;EN;;6;6;6;N;EASTERN ARABIC-INDIC DIGIT SIX;;;;
06F7;EXTENDED ARABIC-INDIC DIGIT SEVEN;Nd;0;EN;;7;7;7;N;EASTERN ARABIC-INDIC DIGIT SEVEN;;;;
06F8;EXTENDED ARABIC-INDIC DIGIT EIGHT;Nd;0;EN;;8;8;8;N;EASTERN ARABIC-INDIC DIGIT EIGHT;;;;
06F9;EXTENDED ARABIC-INDIC DIGIT NINE;Nd;0;EN;;9;9;9;N;EASTERN ARABIC-INDIC DIGIT NINE;;;;

Date/Time: Thu Nov 25 10:22:52 CST 2021
Name: Giacomo Catenazzi
Report Type: Error Report
Opt Subject: NameAliases.txt

C0 chart (https://www.unicode.org/charts/PDF/U0000.pdf) uses the
abbreviation EM for 0x19, but in NamesAlias.txt only EOM is listed as
abbreviation.  Because EM is used in various ISO (and ANSI, and ECMA, e.g.
ECMA-48 and the C0 table is linked also in ECMA-6 [ISO 646]), I think
NameAliases.txt should include also a third line:

0019;END OF MEDIUM;control
0019;EOM;abbreviation
0019;EM;abbreviation    <- NEW LINE HERE

Note: the name 'EM' seems available in Unicode.

BTW it seems EOM was previously used in first version of ASCII as abbr. of
0x03 instead of ETX (as end of message) (according Wikipedia and the
scanned version). EM will just avoid confusion, and it is more used(you use
it on C0 chart).

Date/Time: Wed Dec 1 10:57:02 CST 2021
Name: J. S. Choi
Report Type: Error Report
Opt Subject: UAX44-LM2 medial-hyphen clarification

The UAX44-LM2 rule defines “medial hyphen” as a “hyphen occurring immediately 
between two letters”; however, it does not clarify whether a medial hyphen 
also may be between a letter and a numeral. For example, if the answer is 
yes, then “VARIATION SELECTOR 15” and “VARIATION_SELECTOR_15” would match 
“VARIATION SELECTOR-15”, and if the answer is no, then they would not match.

Feedback routed to Emoji SC for evaluation

Date/Time: Thu Oct 14 10:43:00 CDT 2021
Name: John B
Report Type: Other Question
Opt Subject: New unicode character unclear?

Hello,

At the new emoji site, there is one listed: https://unicode.org/emoji/charts-14.0/emoji-released.html 

#16 is 1FAF1 200D 1FAF2 (left hand, zero width joiner, right hand) -- I'd love to know 
what this final single emoji is, or if there is any information available about that. Is that a bug?

Date/Time: Mon Nov 8 15:04:37 CST 2021
Name: David Corbett
Report Type: Error Report
Opt Subject: UTS #51

What happens when an emoji_zwj_sequence overlaps a
text_presentation_sequence? It is not clear what to do when a text
presentation selector appears at the end of an emoji zwj sequence. For
example, how should <U+1F408, U+200D, U+2B1B, U+FE0E> be rendered?

• The same as <U+1F408, U+200D, U+2B1B>
• The same as <U+1F408, U+FE0E, U+2B1B, U+FE0E>
• The same as <U+1F408, U+2B1B, U+FE0E>

UTS #51 says “A text presentation selector breaks an emoji zwj sequence,
preventing characters on either side from displaying as a single image. The
two partial sequences should be displayed as separate images, each with
presentation style as specified by any presentation selectors present, or
by default style for those emoji that do not have any variation selectors.”
Taken literally, that means <U+1F408, U+200D, U+2B1B, U+FE0E> is
split into two sequences, <U+1F408, U+200D, U+2B1B> and an empty
sequence, so the whole thing should be rendered the same as <U+1F408,
U+200D, U+2B1B>. That is probably not what was intended.

Feedback routed to Editorial Committee for evaluation

Date/Time: Sat Sep 18 10:48:29 CDT 2021
Contact: noneed (at) example.com
Name: Jackie
Report Type: Error Report
Opt Subject:

Note: Fake return address was supplied, so cannot contact submitter.

Hi again,

The code charts ( https://www.unicode.org/charts/ ) each should include a standard 
key to the symbols used (e.g., →, ~, ※, etc.). Nothing I see on the code chart PDFs 
defines these symbols or even links to a definition of them.

I looked around and found ( https://www.unicode.org/charts/About.html#Conventions ), 
but I usually access the code charts from pages that contain no link to that page, 
and some are saved locally.

Thank you!

Date/Time: Sat Sep 18 15:44:27 CDT 2021
Name: David Corbett
Report Type: Error Report
Opt Subject: Mistakes in definition D56

Definition D56 in chapter 3 says “Combining character sequences involving a
variation selector (which is both default_ignorable and a combining mark),
consist of only the base character followed by a single variation
selector”, but that is not true. U+1031 MYANMAR VOWEL SIGN E is not a base
character, but it does have a defined variation sequence. Also, you could
have a sequence like <U+0030, U+FE0F, U+20E3>, which does not consist
of *only* the base character followed by a single variation selector: it
consists of the base, the variation selector, and another mark.

Date/Time: Mon Sep 27 18:48:22 CDT 2021
Name: Eduardo Marín Silva
Report Type: Other Document Submission
Opt Subject: On the Tiddu mark and Virama+Repha of Tulu-Tigalari

This is a response to the following document:
https://www.unicode.org/L2/L2021/21210-tulu-tigalari.pdf 

In page 41, section 8.2 it explains the function of the mark and even
compares it to a "caret".  Currently the dotted circle in the
representative glyph, suggests this is a combining sign; but it is my
opinion that this should be treated similarly to the caret; a zero advance
graphical indicator. This is because the sign is meant to be an
after-the-fact addition to the text, which means it should not affect the
original spacing of the text at all; this includes vowel signs that apply
below the base. If the current model is used, the rendering of the script
would become more complicated that it already is. This change would also
make it easier to display it in more situations, like after whitespaces or
non-letters. The general category of it would be 'Po' and the CCC would be
0.This change of properties would also disambiguate it from other
characters like, 208A ₊ SUBSCRIPT PLUS SIGN and 031F ◌̟ COMBINING PLUS SIGN
BELOW

I would also like to suggest to encode one more character, to reproduce the
behavior on page 34, where the Virama and the Repha can fuse, despite them
not being adjacent in the sequence. Instead, I propose encoding another
character called: TULU-TIGALARI VIRAMA WITH REPHA. This would reduce the
complexity necessary to input this character. It can have the same
properties as the Virama and be placed at 113DE, so no characters need to
be shifted from their current positions.

Date/Time: Fri Oct 1 14:05:49 CDT 2021
Name: David McCreedy
Report Type: Error Report
Opt Subject: The Unicode Standard, Version 14.0.0

FYI: Section 15.15 of The Unicode Standard still lists the old Ahom block range 
end (Ahom: U+11700–U+1173F) instead of the 14.0 updated range end (U+1174F) at 
https://www.unicode.org/versions/Unicode14.0.0/ch15.pdf#G95570.  Refer to the 
"11700..1174F; Ahom" line in http://www.unicode.org/Public/UNIDATA/Blocks.txt 
for confirmation.  Thanks.

Date/Time: Fri Oct 1 16:13:29 CDT 2021
Name: Peter Constable
Report Type: Error Report
Opt Subject: Kayah Li code chart / NamesList.txt

Note: This has already been taken into account in the Unicode 15.0 nameslist draft.

In the Kayah Li names list, the following vowel letters are listed under the 
subhead "Consonants":

A922 ꤢ KAYAH LI LETTER A
A923 ꤣ KAYAH LI LETTER OE
A924 ꤤ KAYAH LI LETTER I
A925 ꤥ KAYAH LI LETTER OO

In NamesList.txt, the @Vowels subhead follows A925, but should be moved up to follow A921.

Date/Time: Wed Oct 6 14:29:53 CDT 2021
Name: Eduardo Marín Silva
Report Type: Other Document Submission
Opt Subject: Pending errata notices


This is a remainder that certain glyph corrections, lack an errata notice; despite 
being recommended by the Script Ad-Hoc. Only the first document precedes UTC #169. 
My intention is avoid the accidental omission of these by having them documented 
togueter for reference.

  Canadian Syllabics: https://www.unicode.org/L2/L2021/21141-ucas-revisions.pdf  
  	(limited to the 3 yellow highlighted characters)
  Old Turkic: https://www.unicode.org/L2/L2021/21153-n5163-old-turkic-glyph.pdf 
  Khitan Small Script: https://www.unicode.org/L2/L2021/21182-khitan-mods.pdf 
  Sundanese: https://www.unicode.org/L2/L2021/21221-three-sundanese-chars.pdf 

Date/Time: Sat Nov 6 14:45:05 CDT 2021
Name: Jens Maurer
Report Type: Error Report
Opt Subject: NamesList.txt

https://www.unicode.org/Public/14.0.0/ucd/NameAliases.txt 

says, in particular,

# Note that no formal name alias for the ISO 6429 "BELL" is
# provided for U+0007, because of the existing name collision
# with U+1F514 BELL.

0007;ALERT;control
0007;BEL;abbreviation

Yet, https://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt 

says

0007	<control>
	= BELL

which (according to section 24.1 of the Unicode standard) introduces 
the normative alias BELL. However, that not desired according to the 
comment in NameAliases.txt.

Date/Time: Sat Nov 6 14:50:58 CDT 2021
Name: Jens Maurer
Report Type: Error Report
Opt Subject: NamesList.txt

https://www.unicode.org/Public/14.0.0/ucd/NameAliases.txt 

says

000A;LINE FEED;control
000A;NEW LINE;control
000A;END OF LINE;control

meaning that all three aliases are intended to be normative aliases 
per section 4.8 of the Unicode standard.

However, https://www.unicode.org/Public/UCD/latest/ucd/NamesList.txt 
says

000A	<control>
	= LINE FEED (LF)
	= new line (NL)
	= end of line (EOL)

meaning that "new line" and "end of line" are not presented as a 
normative alias in CodeCharts.pdf (because they are not uppercase).

(The same situation appears for other control characters that have 
more than one alias.)

Date/Time: Mon Nov 8 11:00:48 CST 2021
Name: Peter Constable
Report Type: Error Report
Opt Subject: UAX #44

In 5.2, the description for Extended_Pictographic says,

"Note: This property is used in the regex definitions for the Default Grapheme 
Cluster Boundary Specification in UAX #29, Unicode Text Segmentation [UAX29], 
as well as for the definition ED-4 in UTS #51, Unicode Emoji [UTS51]."

It fails to mention use in LB30b that was added to UAX #14 in Unicode 14.

Date/Time: Tue Nov 23 16:53:54 CST 2021
Name: Jonathan Yavner
Report Type: Error Report
Opt Subject: UAX #14

"If U+2061 CAUTION SIGN had been used, which also looks like an 
exclamation point inside a triangle, ..."

But U+2061 is actually "FUNCTION APPLICATION", which has no appearance.

The text should read "U+2621 CAUTION SIGN".

This error was introduced in version 19 (dated 2006-08-22) and 
has lain there in plain sight ever since.

Date/Time: Wed Nov 24 14:16:50 CST 2021
Name: Petr Viktorin
Report Type: Error Report (UTR #39)
Opt Subject:

Section 4, Confusable Detection in UTR#39   refers to  Section 2.9.1, 
Backward Compatibility in Unicode Technical Report #36
The correct section number for "Backward Compatibility" is 2.10.1

See:
 https://www.unicode.org/reports/tr39/#Confusable_Detection 
 https://www.unicode.org/reports/tr36/#Backwards_Compatibility 

Similar errors appear in 5.2 Restriction-Level Detection, 6 Development 
Process, 6 Development Process, and 3.1 General Security Profile for Identifiers of UTR#39

Date/Time: Sun Dec 5 00:48:41 CST 2021
Name: Norbert Lindenberg
Report Type: Error Report
Opt Subject: Core specification

The introduction of chapter 16 of the Unicode Standard, “Southeast Asia” states 
“The scripts of Southeast Asia are written from left to right.”

This statement is not correct for all scripts of Southeast Asia; 
Hanifi Rohingya is written from right to left.

Date/Time: Sun Jan 2 06:27:41 CST 2022
Name: Ivan Panchenko
Report Type: Error Report
Opt Subject: UAX #14 and UAX#44


In the Unicode Standard Annex #14, it is said about hyphenation that
in “German and Swedish, a consonant is sometimes doubled”. I suggest
changing “German” to “pre-reform German orthography” because nowadays no
consonant is struck out compared to the hyphenated form,
e.g., “Schifffahrt” is written with three fs even when unhyphenated
(pre-reform: “Schiffahrt”, hyphenated “Schiff- / fahrt”).

Also, UAX #14 contains the doublings “the the” and “by by”.

UAX #44 contains the mistakes “stabiity”
(instead of “stability”), “inadvertant”
(instead of “inadvertent”), “definining”
(instead of “defining”), “discunifications”
(instead of “disunifications”), “compatiblity” (instead of “compatibility”)
and “"TU-" (kIRG_TSource0 prefix, or 'VU-" (kIRG_VSource0 pefix”
(instead of “"TU-" (kIRG_TSource0) prefix, or "VU-"
(kIRG_VSource0) prefix”).

Date/Time: Thu Jan 6 12:26:50 CST 2022
Name: John Hudson
Report Type: Error Report
Opt Subject:

Page 488 in the Bengali section of chapter 12 (South and Central Asia-I) of
TUS discusses Jihvamuliya and Upadhmaniya in ligatures with following
consonant letters, hopefully making it clear to shaping engine implementers
that these character sequences should be treated as clusters for shaping
purposes. A similar discussion with examples is missing from the Devanagari
section of the same chapter.

The Devanagari and Bengali handling of Jihvamuliya and Upadhmaniya are
graphically distinct but functionally identical, and this should be
reflected in parallel discussions, perhaps with added explicit statements
that these sequences should be processed as clusters.

Date/Time: Fri Jan 7 16:22:34 CST 2022
Name: Ivan Panchenko
Report Type: Error Report
Opt Subject: UAX #42

UAX #42 contains the following mistakes: “the the” (instead of “the”), 
“intented” (instead of “intended”), “inheritence” (instead of “inheritance”), 
“accross” (instead of “across”), “attribues” (instead of “attributes”), 
“representedy” (instead of “represented”).

Date/Time: Sun Jan 16 10:47:31 CST 2022
Contact: ivanpan3@gmail.com
Name: Ivan Panchenko
Report Type: Error Report
Opt Subject: UTS #51

UTS #51 contains the following mistakes: "a emoji" (instead of "an emoji"), 
"existing existing" (instead of "existing"), "color which is" (instead of 
"color is"), "should taken" (instead of "should be taken"), "is all a perfectly 
legitimate" (instead of "is all perfectly legitimate"), "user‘s" (instead of 
"user’s", note the apostrophe), "any any" (instead of "any"), "“us’" (instead 
of "“us”"), "”demon“" (instead of "“demon”", note the quotation marks).

In some occurrences of "[CLDR]", the closing bracket is part of the link text.

Error Reports

Date/Time: Mon Jan 17 20:02:43 CST 2022
Name: Norbert Lindenberg
Report Type: Error Report
Opt Subject: ScriptExtensions.txt

The proposal for the Tai Le script, L2/01-369, describes the use of 
five “existing nonspacing diacritics in the UCS” as tone marks in 
an older orthography of the script. Apparently this refers to the 
following characters from the Combining Diacritical Marks block:
U+0300 COMBINING GRAVE ACCENT
U+0301 COMBINING ACUTE ACCENT
U+0307 COMBINING DOT ABOVE
U+0308 COMBINING DIAERESIS
U+030C COMBINING CARON

The Script_Extensions property values of these characters in 
Unicode 14.0 do not indicate their use in the Tai Le script. 
They should.