Comments on Public Review Issues

L2/22-063

Comments on Public Review Issues
(January 19 - April 11, 2022)

The sections below contain links to permanent feedback documents for the open Public Review Issues as well as other public feedback as of April 11, 2022, since the previous cumulative document was issued prior to UTC #170 (January 2022).

Issue Name Feedback Link

450 Proposed Update UAX #31 Unicode Identifier and Pattern Syntax (feedback)

449 Proposed Update UAX #9, Unicode Bidirectional Algorithm (feedback) No feedback at this time

448 Proposed Update UAX #41, Common References for Unicode Standard Annexes (feedback) No feedback at this time

447 Proposed Update UAX #24, Unicode Script Property (feedback) No feedback at this time

446 Proposed Update UAX #14, Unicode Line Breaking Algorithm (feedback)

445 Proposed Update UAX #45, U-source Ideographs (feedback)

444 Proposed Update UAX #34, Unicode Named Character Sequences (feedback) No feedback at this time

443 Unicode Emoji 15.0 Draft Candidates (feedback)

442 Unicode 15.0 Alpha Review (feedback)

441 Proposed Update UAX #29, Unicode Text Segmentation (feedback)

440 Proposed Update UTS #10, Unicode Collation Algorithm (feedback)

439 Proposed Update UAX #50, Unicode Vertical Text Layout (feedback) No feedback at this time

438 Proposed Update UAX #44, Unicode Character Database (feedback)

437 Proposed Update UAX #38, Unicode Han Database (Unihan) (feedback) No feedback at this time

434 CLDR Person Name Formatting (feedback)

427 Proposed Update UTS #18, Unicode Regular Expressions (feedback)

Issue	Name	Feedback Link
450	Proposed Update UAX #31 Unicode Identifier and Pattern Syntax	(feedback)
449	Proposed Update UAX #9, Unicode Bidirectional Algorithm	(feedback) No feedback at this time
448	Proposed Update UAX #41, Common References for Unicode Standard Annexes	(feedback) No feedback at this time
447	Proposed Update UAX #24, Unicode Script Property	(feedback) No feedback at this time
446	Proposed Update UAX #14, Unicode Line Breaking Algorithm	(feedback)
445	Proposed Update UAX #45, U-source Ideographs	(feedback)
444	Proposed Update UAX #34, Unicode Named Character Sequences	(feedback) No feedback at this time
443	Unicode Emoji 15.0 Draft Candidates	(feedback)
442	Unicode 15.0 Alpha Review	(feedback)
441	Proposed Update UAX #29, Unicode Text Segmentation	(feedback)
440	Proposed Update UTS #10, Unicode Collation Algorithm	(feedback)
439	Proposed Update UAX #50, Unicode Vertical Text Layout	(feedback) No feedback at this time
438	Proposed Update UAX #44, Unicode Character Database	(feedback)
437	Proposed Update UAX #38, Unicode Han Database (Unihan)	(feedback) No feedback at this time
434	CLDR Person Name Formatting	(feedback)
427	Proposed Update UTS #18, Unicode Regular Expressions	(feedback)

The links below go to locations in this document for feedback.

Feedback routed to CJK & Unihan Group for evaluation [CJK]
Feedback routed to Script ad hoc for evaluation [SAH]
Feedback routed to Properties & Algorithms Group for evaluation [PAG]
Feedback routed to Emoji SC for evaluation [ESC]
Feedback routed to Editorial Committee for evaluation [EDC]
Other Reports

Feedback routed to CJK & Unihan Group for evaluation [CJK]

Date/Time: Sun Feb 6 09:46:29 CST 2022
Name: Ken Lunde
Report Type: Error Report
Opt Subject: UAX #45 USourceData.txt errors

Per the USourceData.txt file for Unicode Version 14.0, the following 11
U-Source ideographs have a status value of G, but are not included in
Extension G (their code point fields are also blank, which is what flagged
them):

UTC-01024;G;;79.7;;⿰圼殳;UTCDoc L2/12-333 56;;11;2
UTC-01161;G;;118.10;;⿳𥫗⿰工口木;UTCDoc L2/12-333 193;;16;1
UTC-01166;G;;152.6;;⿳亠䇂豕;UTCDoc L2/12-333 198;;13;4
UTC-01220;G;;32.11;;⿰土畢;UTCDoc L2/15-177 19;;14;2
UTC-01244;G;;85.9;;⿰氵⿳⺊彐龰;UTCDoc L2/15-177 43;;12;2
UTC-01256;G;;85.16;;⿰氵⿵門⿱土必;UTCDoc L2/15-177 55;;19;2
UTC-01257;G;;85.17;;⿰𤀤殳;UTCDoc L2/15-177 56;;20;3
UTC-01272;G;;86.11;;⿱炏冏;UTCDoc L2/15-177 71;;15;4
UTC-01276;G;;86.13;;⿱𤇾冏;UTCDoc L2/15-177 75;;17;4
UTC-01301;G;;167.9;;⿰金朐;UTCDoc L2/15-177 100;;17;3
UTC-01304;G;;167.11;;⿰金⿱谷心;UTCDoc L2/15-177 103;;19;3

I determined that the following six were unified with existing CJK Unified
Ideographs per the IRG, and their status and code point fields should
therefore be changed as follows:

UTC-01024;U;U+6BC0;79.7;;⿰圼殳;UTCDoc L2/12-333 56;;11;2
UTC-01161;U;U+7BC9;118.10;;⿳𥫗⿰工口木;UTCDoc L2/12-333 193;;16;1
UTC-01166;B;U+27C4F;152.6;;⿳亠䇂豕;UTCDoc L2/12-333 198;;13;4
UTC-01220;F;U+2D3EC;32.11;;⿰土畢;UTCDoc L2/15-177 19;;14;2
UTC-01244;B;U+23D8F;85.9;;⿰氵⿳⺊彐龰;UTCDoc L2/15-177 43;;12;2
UTC-01304;B;U+28B02;167.11;;⿰金⿱谷心;UTCDoc L2/15-177 103;;19;3

See:

https://hc.jsecs.org/irg/ws2015/app/?find=UTC-01024 
https://hc.jsecs.org/irg/ws2015/app/?find=UTC-01161 
https://hc.jsecs.org/irg/ws2015/app/?find=UTC-01166 
https://hc.jsecs.org/irg/ws2015/app/?find=UTC-01220 
https://hc.jsecs.org/irg/ws2015/app/?find=UTC-01244 
https://hc.jsecs.org/irg/ws2015/app/?find=UTC-01304 

A six-ideograph horizontal extension proposal can therefore be submitted.

The remaining five seem to have been withdrawn from IRG Working Set 2015, so
I suggest that their status fields be changed to N so that they can be
considered for re-submission in the future:

UTC-01256;N;;85.16;;⿰氵⿵門⿱土必;UTCDoc L2/15-177 55;;19;2
UTC-01257;N;;85.17;;⿰𤀤殳;UTCDoc L2/15-177 56;;20;3
UTC-01272;N;;86.11;;⿱炏冏;UTCDoc L2/15-177 71;;15;4
UTC-01276;N;;86.13;;⿱𤇾冏;UTCDoc L2/15-177 75;;17;4
UTC-01301;N;;167.9;;⿰金朐;UTCDoc L2/15-177 100;;17;3

That is all.

Date/Time: Sun Feb 13 18:30:55 CST 2022
Name: Paul Masson
Report Type: Error Report
Opt Subject: Unihan

U+4F3C is most commonly pronounced sì, but kMandarin for this character is
still given as shì in version 14. Shouldn't this be changed or at least
both prounciations given?

Date/Time: Sun Feb 13 18:38:19 CST 2022
Name: Paul Masson
Report Type: Error Report
Opt Subject: Unihan

U+78D7 formerly had a kPhonetic value of 269, which was change in version 14
to 1157*. The character clearly does not belong to this group. I would
suggest it be given a kPhonetic of 269* since I cannot locate it in Casey.

In fact the entire phonetic group 1157 is far too large compared to Casey.
There are 263 characters alone with kPhonetic 1157*. This appears to be a
major error for characters not in Casey that were assigned to the same
group regardless of phonetics. Someone really needs to figure out when this
batch was added and why.

Please feel free to follow up with me on phonetic group 1157. Thank you.

Date/Time: Tue Feb 15 23:37:26 CST 2022
Name: Eiso Chan
Report Type: Error Report
Opt Subject: Radical Errors

Please update the kRSUnicode for U+3B3A as below. I have mentioned this issue in IRGN2239.

U+3B3A	74.9

U+2D15F is the variant of U+8352 as Moji Joho project shows and it's similar 
to U+2E3BB, so the best radical should be #140. It's better to change the RS 
information or add the secondary RS for it.

U+2D15F	140.6
or
U+2D15F	23.7 140.6

Date/Time: Sat Mar 12 04:49:28 CST 2022
Name: Edward
Report Type: Error Report
Opt Subject:

I found out an issue in Unihan Database.Some kTotalStrokes values of the characters 
with the radical 邑 or 阜 may be wrong.For example,kTotalStrokes value of U+2B545 
𫕅 is 10,while U+2CBC0 𬯀 is 9.The radical 阝has 2 strokes in the blocks from CJKUI 
to CJK-ExtD,while it has 3 strokes in the blocks from CJK-ExtE to CJK-ExtG.I wonder 
whether this is wrong.In the other words,the stroke of 阝is 3 since Unicode® 8.0.0 
was published.
That's all.

Date/Time: Tue Mar 15 05:51:07 CDT 2022
Name: Andrew West
Report Type: Error Report
Opt Subject: CJK Ext B code chart

There are two Vietnam ideographs with identical shape but different source 
references for two different CJK unified ideographs:
VN-058B6 at U+58B6 is ⿰土達; G and H glyphs are also ⿰土達
VN-2143F at U+2143F is also ⿰土達; but G and H glyphs are ⿰土逹 (one stroke less)
I think the V source glyph for U+2143F should be modified to match the 
G and H glyphs for U+2143F (and to distinguish it from the V glyph for U+58B6).

Date/Time: Mon Apr 11 09:24:29 CDT 2022
Name: Jaycee Carter
Report Type: Error Report
Opt Subject: Unihan_IRGSources.txt and CJK Unified Ideographs code chart

This is to report an error relating to CJK character stroke counts:

U+5954: kRSUnicode is currently 37.6. This should be 37.5.
U+595F: kRSUnicode is currently 37.9. This should be 37.8.

kTotalStrokes is correct for both characters.

Feedback routed to Script ad hoc for evaluation [SAH]

Date/Time: Wed Jan 26 10:38:02 CST 2022
Name: Halbast Abdullah
Report Type: Other Document Submission
Opt Subject: Kurdish language problems with the Arabic Script

Hi, I wanted to comment on the Arabic Script in Unicode, Central Kurdish
uses the Arabic Script and there's a problem, we have words that have للە
in them, and you can already see that it automatically makes it (Allah) in
Arabic, words like (گوللە، کەللە، کوللە) in Kurdish, that have nothing to
do with (Allah), are messed up because of this automatic change, the words
should be written without the Shaddah and the little Elif. One solutions is
if you let us choose if we want it to be للە with the Shaddah and Elif or
not. We, as we I mean the Central Kurdish language, would appreciate if you
can review this and fix it.
Thanks!

Date/Time: Mon Feb 7 15:25:04 CST 2022
Name: Elango Cheran
Report Type: Error Report
Opt Subject: ScriptExtensions.txt

I am a speaker of Tamil, and I notice that the data for the Script_Extensions 
property marks both danda and double danda (U+0964 DEVANAGARI DANDA and U+0965 
DEVANAGARI DOUBLE DANDA) as having the `Taml` script code in the extensions. 
I have consulted with people in the community, and none of us have ever observed 
or are aware of any use cases (neither modern uses nor otherwise) that use these 
characters.

If indeed there are no documented usages, then the above association in 
ScriptExtensions.txt would be a bug.

Date/Time: Thu Mar 3 19:17:44 CST 2022
Name: David Corbett
Report Type: Other Document Submission
Opt Subject: Unhelpful advice about U+0F35 and U+0F37

Re the Tibetan marks U+0F35 and U+0F37, chapter 13 says “If they are treated 
as normal combining marks, they can be entered into the text following the 
vowel signs in a stack”. Should they be treated as normal combining marks? 
If not, where should they appear in a stack? The standard should clearly 
specify how to use these code points, and not give such diffident advice.

Date/Time: Thu Mar 3 19:44:29 CST 2022
Name: David Corbett
Report Type: Other Document Submission
Opt Subject: How to encode multiple Tibetan vowels at the same height?

Section 13.4 discusses Tibetan stacks with multiple vowel signs. Usually,
when there are multiple vowel signs above the base, they are rendered from
bottom to top. In what order should they be encoded when they are rendered
side by side? In particular, which of <U+0F68, U+0FA0, U+0F80,
U+0F72> and <U+0F68, U+0FA0, U+0F72, U+0F80> is the right encoding
for the stack with U+0F80 to the left of U+0F72?

Date/Time: Sun Mar 13 16:30:37 CDT 2022
Name: David Corbett
Report Type: Error Report
Opt Subject: IndicPositionalCategory.txt

The Kayah Li vowel signs U+A926..U+A92A should have Indic_Positional_Category = Top.

Date/Time: Wed Mar 16 13:13:48 CDT 2022
Name: David Corbett
Report Type: Other Document Submission
Opt Subject: Order of Indic cantillation marks

What should the relative order be between above- and below-base marks with 
Indic_Syllabic_Category=Cantillation_Mark and post-base marks like visarga? 
Microsoft’s Indic shaper expects such Vedic marks at the end of the cluster, 
but Microsoft’s USE expects them to be mixed in with other marks, meaning 
they precede post-base marks.

Date/Time: Sun Apr 10 09:14:28 CDT 2022
Name: Tuğrul Çavdar
Report Type: Other Question
Opt Subject: Regarding to “2021-04-21 Application for Adding Letters to Old Turkish Alphabet — Gökbey Uluç”

Dear Unicode Consortium,

Regarding to “2021-04-21 Application for Adding Letters to Old Turkish
Alphabet — Gökbey Uluç”:
https://www.unicode.org/L2/L2021/21081-old-turkish-add.pdf

Old Turkic alphabet does not have letters corresponding to F, H, V, J, C
consonants because there was none of these sounds in the era when Old
Turkic was used (before A. D. 840). Also “O/U” vowels are written in same
letter: 𐰆 , as “Ö/Ü” in same letter: 𐰇 as well. The current letters of Old
Turkic table defined in https://unicode.org/charts/PDF/U10C00.pdf are
correct.

Gökbey has fabricated new letters for F, H, V, J, C on his own to write
today’s Turkish with Old Turkic letters. He also use
“10C0A 𐰊 OLD TURKIC LETTER YENISEI AB” letter for “O” vowel,
“uş/ush” letter (one of two letters in his proposal) for “Ö” vowel.

He also use
“10C06 𐰆 OLD TURKIC LETTER ORKHON O/U” letter for “U” vowel only, and
“10C07 𐰇 OLD TURKIC LETTER ORKHON OE/UE” letter for “Ü” vowel only.

His fabricated alphabet is
http://2.bp.blogspot.com/-RGZcnZW6bos/VISzObbuwpI/AAAAAAAACeM/0mnoucBfr30/s1600/cagdas-turk-damgalari.png

(from his blog: http://kokturukce.blogspot.com/2011/04/yeni-damgalar-yeni-yaz-duzeni.html)

The reason why he has proposed these two letters is to use for “Ö” and “AH”
instead of using for “UŞ/USH” and “IÇ/ICH”. So, he plans to use his
fabricated alphabet in digital platforms.

There are many variations of Old Turkic letters as can be seen in:
http://www.tamga.org/2014/12/farkl-dillerdeki-kitablarda-kokturuk.html

It is impossible to produce codes for all variations. The current letters in
Old Turkic table of Unicode.Org are correct and the direction of the
current iç/ich: 𐰱 is correct.

For your information.

Yours sincerely,

Assoc. Prof. Tuğrul Çavdar, Ph. D.
Karadeniz Technical University
Trabzon, Turkey

Feedback routed to Properties & Algorithms Group for evaluation [PAG]

Date/Time: Wed Jan 19 03:46:02 CST 2022
Name: Reini Urban
Report Type: Error Report
Opt Subject: tr31-latest

TR31 Security Bugs (UCD Versions 1-14)
======================================

1. U+FF00..U+FFEF not as ID
---------------------------

Most of the U+FF00..U+FFEF Full and Halfwidth letters have incorrectly
`ID_Start` resp.  `ID_Continue` properties. XID ditto.
They should not, because they are confusable with the normal
characters in the base planes.  E.g. LATIN A-Z are indistuingishable
from Ａ..Ｚ, LATIN a-z from ａ..ｚ, likewise for the Katakana ｦ..ｯ and
ｱ..ﾝ, and the Hangul ﾠ..ﾾ, ￂ..ￇ, ￊ..ￏ, ￒ..ￗ ￚ..ￜ halfwidth letters.

This is esp. for TR39 a security risk. TR39 provides Identifier Type
properties to exclude insecure identifiers, but I cannot find any
other type property to set these U+FF21..U+FFDC IDs to, than
`Not_XID`. Thus the `ID_Start`/`ID_Continue` property should be
deleted for all of them. If they are not identifiable, they should not
be marked as such.
Since XID's are guaranteed stable and nobody cares yet about TR39, I 
would accept a new TR39 Identifier Type property Confusable, or just 
set the Not_XID property there for these.
But really, defects, esp. security defects should be fixed.

2. Medial letters in `ID_Start`, not `ID_Continue`
--------------------------------------------------

DerivedCoreProperties lists all of the Arabic and Thai MEDIAL letters,
which are part of identifiers in `ID_Start`, not in `ID Continue`. Only
the Combining marks are in `ID_Continue`.  Thus all unicode-aware
parsers accept all MEDIAL letters incorrectly in the start
position. They should only be allowed in the `ID_Continue` position,
and parsers should disallow them in the end positions for identifiers.

All the other medial letters (Myanmar, Canadian Aboriginal, Ahom,
Dives Akuru) are not part of Recommended Scripts, so they do not
affect TR39 security. But since almost nobody but Java, cperl and Rust
honor TR39 it's still affecting most parsers.

Other medial exceptions are noted in TR31 at 2.4 Specific Character
Adjustments, but the tables DerivedCoreProperties and TR39 Identifier
tables and thus all user parsers are wrong.
< https://www.unicode.org/reports/tr31/#Specific_Character_Adjustments >

Date/Time: Sat Feb 19 16:32:04 CST 2022
Name: Karl Williamson
Report Type: Error Report
Opt Subject: UCD

U+1F8A0: LEFTWARDS BOTTOM-SHADED WHITE ARROW   🢠
U+1F8A1: RIGHTWARDS BOTTOM SHADED WHITE ARROW   🢡

While not technically an error, the names of these symmetric characters are 
asymmetric.  One has a HYPHEN-MINUs between BOTTOM and SHADED and the other 
has a SPACE.  It would be helpful to add a Name Alias to one or the other

Date/Time: Thu Feb 24 03:26:53 CST 2022
Name: Martin J. Dürst
Report Type: Website Problem
Opt Subject: Case Charts

In the case charts at https://www.unicode.org/charts/case/,
together with Yusuke Endoh, a fellow Ruby committer, I discovered a 
problem: It lists the lowercased version of U+0130 (İ) as U+0069 (i). 
This is a simple case mapping, the full case mapping is U+0069 U+0307 
at https://www.unicode.org/Public/UCD/latest/ucd/SpecialCasing.txt, 
line 69. The case charts don't say anything about simple vs. full case 
mappings, they should say something (it's unclear for me at the moment 
exactly what they should say, because it's unclear to me exactly what they do).

Date/Time: Mon Mar 28 18:33:50 CDT 2022
Name: David Corbett
Report Type: Other Document Submission
Opt Subject: Vai line breaking

This is feedback on L2/22-080. Another script with line breaks between orthographic 
syllables is Vai. The description in chapter 19 indicates that most Vai letters 
should have lb=ID, and U+A60B and U+A60C should have lb=BA. The “h-” characters might be ID or BA.

Feedback routed to Emoji SC for evaluation [ESC]

Date/Time: Sun Mar 6 21:15:52 CST 2022
Name: Fake Unicode
Report Type: Error Report
Opt Subject: emoji-list.html & emoji-test.txt

Per [https://twitter.com/roozbehp/status/1500663503316602882] It would be better to 
categorize the emoji 🍄 1f344 MUSHROOM under subcategory "plant-other" rather than 
as "food-vegetable", since all vendors show it as an inedible poisonous toadstool 
[ref: https://emojipedia.org/mushroom/].

Date/Time: Sat Apr 9 07:14:54 CDT 2022
Name: Matthias Reitinger
Report Type: Error Report
Opt Subject: emoji-test.txt

The data file https://www.unicode.org/Public/emoji/14.0/emoji-test.txt
(Date: 2021-08-26, 17:22:23 GMT) contains these 10 code point sequences
with the status "unqualified":

1F441 FE0F 200D 1F5E8 ; unqualified # 👁️‍🗨 E2.0 eye in speech bubble
1F575 FE0F 200D 2642 ; unqualified # 🕵️‍♂ E4.0 man detective
1F575 FE0F 200D 2640 ; unqualified # 🕵️‍♀ E4.0 woman detective
1F3CC FE0F 200D 2642 ; unqualified # 🏌️‍♂ E4.0 man golfing
1F3CC FE0F 200D 2640 ; unqualified # 🏌️‍♀ E4.0 woman golfing
26F9 FE0F 200D 2642 ; unqualified # ⛹️‍♂ E4.0 man bouncing ball
26F9 FE0F 200D 2640 ; unqualified # ⛹️‍♀ E4.0 woman bouncing ball
1F3CB FE0F 200D 2642 ; unqualified # 🏋️‍♂ E4.0 man lifting weights
1F3CB FE0F 200D 2640 ; unqualified # 🏋️‍♀ E4.0 woman lifting weights
1F3F3 FE0F 200D 26A7 ; unqualified # 🏳️‍⚧ E13.0 transgender flag

I believe these code point sequences should be "minimally-qualified"
instead.

The Unicode® Technical Standard #51 Revision
21 <https://www.unicode.org/reports/tr51/tr51-21.html> defines these
terms:

> ED-17a. qualified emoji character — An emoji character in a string
that (a) has default emoji presentation or (b) is the first character in
an emoji modifier sequence or (c) is not a default emoji presentation
character, but is the first character in an emoji presentation sequence.

> ED-18. fully-qualified emoji — A qualified emoji character, or an emoji
sequence in which each emoji character is qualified.

> ED-18a. minimally-qualified emoji — An emoji sequence in which the
first character is qualified but the sequence is not fully qualified.

> ED-19. unqualified emoji — An emoji that is neither fully-qualified nor
minimally qualified.

Each of the sequences in question is an emoji zwj sequence (ED-16) with two
elements.

The first element of each sequence is an emoji presentation sequence
(ED-9a), where the emoji character is not a default emoji presentation
character. Therefore the first character is a qualified emoji character
according to ED-17a (c).

The second element of each sequence is a single emoji character that does
not have default emoji presentation. It is therefore not a qualified emoji
character according to ED-17a.

So the emoji zwj sequence contains one qualified emoji character (the first
emoji character) and one non-qualified emoji character (the second emoji
character).

According to ED-18 the sequence is not a fully-qualified emoji, because not
every emoji is qualified.

But the sequence is minimally-qualified according to ED-18a, as the first
emoji character is qualified, but the sequence is not fully-qualified.

Therefore the listed sequences should be marked as minimally-qualified in
the emoji-test.txt data file.

Feedback routed to Editorial Committee for evaluation [EDC]

Date/Time: Mon Jan 17 01:46:41 CST 2022
Name: Norbert Lindenberg
Report Type: Error Report
Opt Subject: Core Specification

Note: This report has already been addressed by the Editorial Committee in a 15.0 draft.

Section 24.1 of the Core Specification, Character Names List, describes 
the Dashed Box Convention: "DashedBoxConvention. There are a number of 
characters in the Unicode Standard which in normal text rendering have 
no visible display, or whose only effect is to modify the display of 
other characters in proximity to them."

Since Unicode 6.0, the dashed box convention has also been applied to 
characters with Indic syllabic category Consonant_Preceding_Repha. 
Such characters are always rendered visibly; the dashed box is used 
to indicate that they require reordering to after the following base 
character.

Date/Time: Thu Jan 27 15:49:01 CST 2022
Name: Ivan Panchenko
Report Type: Error Report
Opt Subject: UAX #24 and UAX #31

UAX #24 contains the mistakes “GREEK LETTER SMALL LETTER OMICRON” (instead 
of “GREEK SMALL LETTER OMICRON”) and “in provided in” (instead of “is provided 
in”). A period is missing after “can be classified by script”.

UAX #31 contains “an definition” (instead of “a definition”) and possibly 
some misplaced spaces (search for “ ,” and “ .”).

Date/Time: Tue Feb 1 01:51:59 CST 2022
Name: Vikki McDonough
Report Type: Error Report
Opt Subject: Unicode 14.0 "Optical Character Recognition" code chart

In the code chart for the Optical Character Recognition block, the reference
glyph for character U+2447, OCR AMOUNT OF CHECK, is misshapen.  The
vertical bar in the middle of the glyph should be centered vertically; if
we take the lower-left rectangle as glyph-component A, the vertical bar as
glyph-component B, and the upper-right rectangle as glyph-component C, and
designate the height of the upper and lower edges of each component as hU(
[A/B/C]) and hL([A/B/C]), respectively, {hU(A)-hL(B)} should equal {hU
(B)-hL(C)}.  However, in the reference glyph for this character in the
official Optical Character Recognition code chart, the vertical bar is too
high up, and {hU(B)-hL(C)} is much greater than {hU(A)-hL(B)}.

This error has been present since at least Unicode 3.0 (the earliest Unicode
version for which an archived copy of the Optical Character Recognition
code chart is retrievable from the Wayback Machine).

Code chart containing the error:
https://www.unicode.org/charts/PDF/U2440.pdf ("Optical Character
Recognition; Range: 2440–245F")

Archived Unicode 3.0 code chart demonstrating a lower bound on the length of
time this error has been present:
https://web.archive.org/web/20010603000706/http://www.unicode.org/charts/PDF/U2440.pdf 

Example of an E-13B-based font showing the correct form of this glyph:
https://commons.wikimedia.org/wiki/File:MICR_char.svg 
(high-resolution version:
https://upload.wikimedia.org/wikipedia/commons/thumb/5/58/MICR_char.svg/2560px-MICR_char.svg.png)

Date/Time: Sat Feb 12 13:44:53 CST 2022
Name: Ivan Panchenko
Report Type: Error Report
Opt Subject: UTR #17

I suggest the following corrections in UTR #17:

"O'Reilley" → "O'Reilly"
"graphic character glyphic identifier" → "graphic character global identifier"
"Graphic Character Set Glyphic Identifier" → "Graphic Character Global Identifier"
"UTF32-LE" → "UTF-32LE"
"an single" → "a single"
"sets, where for example," → "sets where, for example,"
"UTF-16 ," → "UTF-16,"
"(“character set” )" → "(“character set”)"
"3.0,..." → "3.0, ..."
"CCS's" → "CCSes"
"UDC's" → "UDCs"
"UAX# 29" → "UAX #29"
"Compression. [BOCU]." → "Compression [BOCU]."

Date/Time: Sun Feb 13 10:55:14 CST 2022
Name: Ivan Panchenko
Report Type: Error Report
Opt Subject: UTR #23

UTR #23 contains the following minor mistakes:

“An code” (instead of “A code”),
“comparsion” (instead of “comparison”),
“applies as” (instead of “apply as”),
“an encoded characters” (instead of “an encoded character”),
“properties the” (instead of “properties of the”),
“Unicode Character database” (instead of “Unicode Character Database”),
“For example 'Character Property', becomes” (instead of “For example, 'Character Property' becomes”).

A space is missing in “results.Proceeding”. I also suggest changing “values, 
(other than the default value)” to “values (other than the default value),”.

The comma here can be deleted:

“accessed,”,
“a property, with”,
“way, is”,
“input, is”,
“for, is”.

Date/Time: Thu Mar 3 20:55:52 CST 2022
Name: David Corbett
Report Type: Error Report
Opt Subject: Chapter 9

The glyphs for positional forms of U+0886 ARABIC LETTER THIN YEH in 
chapter 9 look identical to those for U+064A ARABIC LETTER YEH. They should be thin.

Date/Time: Thu Mar 17 19:31:23 CDT 2022
Name: Martin J. Dürst
Report Type: Error Report
Opt Subject: Unicode version 14.0.0, section 5.4

This is not really an error, but a place where language could be improved.
Section 5.4 of Unicode 14.0.0
(https://www.unicode.org/versions/Unicode14.0.0/ch05.pdf) contains the
following:

```
Because the ranges are disjoint, each code unit in well-formed UTF-16 must
meet one of only three possible conditions:
• A single non-surrogate code unit, representing a code point between 0 and
D7FF16 or between E00016 and FFFF16
• A leading surrogate, representing the first part of a surrogate pair
• A trailing surrogate, representing the second part of a surrogate pair
```

The wording here is a bit strange. "Condition" seems to require "It is ..."
in each of the bulleted items. Either add "It is " to each bullet, or
change the preceding text to say "it is one of the following three".

Date/Time: Fri Mar 18 20:28:10 CDT 2022
Name: Eduardo Marín Silva
Report Type: Public Review Issue
Opt Subject: 442

On the codecharts for Cyrillic Extended-D some of the characters use the
Greek letterforms (of Delta and Phi respectively) rather than the Cyrillic
ones (of be and ef respectively). These are: 1E031, 1E042, 1E052 &
1E060. The latter two are just the subscript version of the former two,
with the same issue.

Date/Time: Sun Apr 10 08:59:51 CDT 2022
Name: David Corbett
Report Type: Other Document Submission
Opt Subject: Armenian left half ring

Section 7.6 “Armenian” says “There is no left half ring in Armenian. Unicode
character U+0559 is not used. It appears that this character is a duplicate
character, which was encoded to represent U+02BB MODIFIER LETTER TURNED
COMMA, used in Armenian transliteration. U+02BB is preferred for this
purpose.” Via https://en.wiktionary.org/wiki/%D5%99 I found
http://www.nayiri.com/imagedBook.jsp?id=1&printPage=10 which shows a
left half ring (or turned apostrophe) being used in the Armenian script in
a book on Armenian dialects. Should this character be encoded as U+0559 or
as U+02BB? The standard should explain which to use in the Armenian script,
because the standard is currently wrong or at least misleading.

Date/Time: Mon Apr 11 17:49:00 CDT 2022
Name: Norbert Lindenberg
Report Type: Error Report
Opt Subject: IndicSyllabicCategory.txt

The file IndicSyllabicCategory.txt has a category Brahmi_Joining_Number, 
which contains only the Brahmi numbers U+11052..U+11065. The documentation 
for that category in the same file says "similar to Number in that in can 
be used as vowel-holders like Consonant_Placeholder, but may also be joined 
by a Number_Joiner of the same script, e.g. in Brahmi".

This contradicts the core specification, section 14.1, which says "the 
numerals U+11052 brahmi number one through U+11065 brahmi number one 
thousand and their ligatures formed with U+1107F brahmi number joiner 
are not used as vowel carriers".

Other Reports

(None at this time.)

L2/22-063