Comments on Public Review Issues

L2/21-169

Comments on Public Review Issues
(July 20 - Sept 25, 2021)

The sections below contain links to permanent feedback documents for the open Public Review Issues as well as other public feedback as of September 25, 2021, since the previous cumulative document was issued prior to UTC #168 (July 20, 2021).

Issue	Name	Feedback Link
427	Proposed Update UTS #18, Unicode Regular Expressions	(feedback)	UTC
426	Proposed Update UTR #53, Unicode Arabic Mark Rendering	(feedback) No new feedback at this time	UTC

Feedback routed to Unihan ad hoc for evaluation

Date/Time: Tue Jul 20 00:42:55 CDT 2021
Name: Jerome Alan Rossignuolo
Report Type: Other Question, Problem, or Feedback
Opt Subject: Missing Fundamental Chinese BuShou Radicals in UNICODE

Hello,

I am doing research into a learning tool for Simplified Chinese. I find it
exceedingly curious that of the 280 Primary and Associated Indexing
Components (部首) as standardized by GB13000.1 Chinese Character Component
Standard (see http://ling.whu.edu.cn/law/002/2016-04-20/1307.html) and used
by the XinHuaZiDian (新华字典）the most widely used dictionary in China, four
are missing from UNICODE! That is, there are 276 of these encoded in
UNICODE and four have no UNICODE encoding!

(A little note, the term Indexing Components and Radicals are commonly
imprecisely interchanged. Most people, even fluent Chinese speakers will
incorrectly refer to what are actually Indexing Components as Radicals.)

Moreover, it is obviously not just me faced with the difficulties of not
being able to type these Indexing Components. I have scoured the Internet
looking for information on them and cannot find any reference to them in
the 100s of documents that do reference the 276 others. These four are
simply left out of virtually all material. Although you can find them in a
scant few places as images. They are in the PDF files of GB13000.1 and are
in the printed XinHuaZiDian dictionary along with they are in the official
XinHuaZiDian mobile app.

These Associated Indexing Components are still in use. Why are they missing
from UNICODE? The Indexing Components are the fundamental building blocks
of Chinese. In a way, they are somewhat analogous to our alphabet. It is
like missing the letter little q from UNICODE. Although the difference is
that one can still type a Chinese character in UNICODE that includes the
missing Indexing Component. Still, it is exceeding odd these cannot be
typed since they need to be typed in any article referencing the Indexing
Components in the Chinese language used by billions of people. The lack of
them on the Internet is testament this is clearly causing people trouble.

Since I cannot type them, I will refer you to reference the missing Indexing
Components in the XinHuaZiDian. They are [50], [55], [68], and [145]. These
can also be found in GB13000.1.

If you need additional clarification or information, please contact me at
[redacted]. I can send you a photo of the missing Indexing
Components.

Sincerely,
Jerome Rossignuolo

Date/Time: Mon Aug 2 17:35:50 CDT 2021
Name: William He
Report Type: Error Report
Opt Subject: Error in definition of U+8561 蕡

Hello,

The definition of U+8561 蕡 is listed as "hemp seeds; plant with abundant". 
Something seems to be missing from this definition. Perhaps it means to say 
"plant with abundant fruit" or something to that effect.

Thanks,
William

Date/Time: Thu Aug 12 09:45:13 CDT 2021
Name: Ken Lunde
Report Type: Error Report
Opt Subject: kSpoofingVariant errors?

The latest version of the JIS X 0208 and JIS X 0213 standards explicitly state 
that U+53F1 叱 is a variant of U+20B9F 𠮟, and can be used in implementations 
that support only the former standard. This is reflected in their kJoyoKanji 
property values:

U+53F1 kJoyoKanji U+20B9F
U+20B9F kJoyoKanji 2010

The following are the kSpoofingVariant property values for these two ideographs:

U+53F1 kSpoofingVariant U+20B9F
U+20B9F kSpoofingVariant U+53F1

Based on their treatment in Jōyō Kanji and the JIS standards as explicit 
variants, perhaps they should instead have kZVariant property values:

U+53F1 kZVariant U+20B9F
U+20B9F kZVariant U+53F1

Of course, this is not urgent, and should be considered for Unicode Version 15.0.

Regards...

-- Ken

Date/Time: Mon Sep 20 23:43:06 CDT 2021
Name: Eiso Chan
Report Type: Error Report
Opt Subject: 4 missing UK glyphs in Unicode, 14.0.0

The following UK glyphs are missing in Unicode, 14.0.0.

UK-02830 for U+238A7
UK-02849 for U+2A909
UK-01320 for U+2B92E
UK-01422 for U+2E66E

Date/Time: Thu Sep 23 15:18:26 CDT 2021
Name: Jaemin Chung
Report Type: Website Problem
Opt Subject: Unihan Database contents search feature update suggestion

I suggest that the Unihan Database contents search feature be updated.
http://unicode.org/charts/unihansearch.html 

1. It should not be limited to the definition, Cantonese, Mandarin, Tang, 
	Japanese on/kun, and Korean (Yale).
2. There should be something like "match whole words only" feature. Someone 
	searching for the reading "han" may not want "chang".
3. For Mandarin, tone numbers no longer work because the Unihan DB now uses 
	tone marks. So the "jing3" example on the main search page should be changed.

Date/Time: Thu Sep 23 23:54:38 CDT 2021
Name: Jaemin Chung
Report Type: Error Report
Opt Subject: U+6F55 and U+23C98

The following kSimplifiedVariant and kTraditionalVariant values should be added 
to the Unihan Database.

U+6F55	kSimplifiedVariant	U+23C98
U+23C98	kTraditionalVariant	U+6F55

(U+6F55 is 潕 and U+23C98 is 𣲘)

Date/Time: Fri Sep 24 00:26:38 CDT 2021
Name: Jaemin Chung
Report Type: Error Report
Opt Subject: U+44E8 & U+7F43 and U+6C84 & U+6F90


Here are additional kSimplifiedVariant and kTraditionalVariant values that 
should be added to the Unihan Database.

U+44E8	kTraditionalVariant	U+7F43
U+7F43	kSimplifiedVariant	U+44E8
(U+44E8 is 䓨 and U+7F43 is 罃)

U+6C84	kTraditionalVariant	U+6F90
U+6F90	kSimplifiedVariant	U+6C84
(U+6C84 is 沄 and U+6F90 is 澐)

Feedback routed to Script ad hoc for evaluation

Date/Time: Mon Aug 2 10:58:51 CDT 2021
Name: Rod Lockwood
Report Type: Other Question, Problem, or Feedback
Opt Subject: Superscripted Ordinal Suffixes

Because you did not make a complete superscript set of the Latin alphabet, there 
is no way to create the superscripted ordinal suffixes st, nd, rd, or th without 
changing the font.

Date/Time: Wed Sep 8 04:10:47 CDT 2021
Name: Brian Sullender
Report Type: Error Report
Opt Subject: Different glyph with the same combination of Code Points

Under code charts in the document "Arabic Presentation Forms-A" i have found
what appears to be ether a typo or error in the specification.

The presentation Code Points FC03 and FBF9 are different glyph's with the
same Code Point combination and both are of the "isolated" form.

I don't know anything about these languages, but this looks wrong.

Found the problem when running an algorithm to import the Presentation Code
Points from the documents into a lookup table.

Date/Time: Thu Sep 9 02:50:37 CDT 2021
Name: Brian Sullender
Report Type: Error Report
Opt Subject: Different glyphs with the same Code Points

I recently reported an error in the document "Arabic Presentation Forms-A"
about 2 conflicting presentation code points. I wanted to inform you there
was 2 other code points that conflict with each other. They are FBFA and
FC68, both have the same combination code points of 0626 and 0649, and both
are "final" forms presentation code points. I haven't found any others.

Feedback routed to Properties & Algorithms ad hoc for evaluation

Date/Time: Fri Aug 6 16:34:05 CDT 2021
Name: Peter Constable
Report Type: Other Question, Problem, or Feedback
Opt Subject: UTS #39 data file default property values

UTC #168 discussed enhancements to use of @missing lines to indicate default property values. 
Coincidentally, I notice that the Identifier_Type and Identifier_Status data files for UTS #39 
do not use the @missing convention to indicate default values at all. Rather, each has a prose 
statement (not machine readable) describing default values. Moreover, each has two separate 
statements.

If UTC is going to be enhancing mechanisms for machine-readable default property values, it 
should consider incorporating the same mechanisms into all data files where relevant.

Date/Time: Mon Aug 30 03:46:13 CDT 2021
Name: Anne van Kesteren
Report Type: Error Report
Opt Subject: ToASCII does not account for trailing dots

If you invoke https://www.unicode.org/reports/tr46/#ToASCII with VerifyDnsLength 
set to true it seems you cannot pass a domain such as `example.org.` (note the 
trailing dot) even though that is a valid domain.

Credit: Gijs Kruitbosch.

Date/Time: Thu Sep 9 06:59:02 CDT 2021
Name: Mickey Rose
Report Type: Error Report
Opt Subject: incorrect grammar in UTS #18: Character Classes with Strings

The auxiliary grammar presented in 2.2.1 Character Classes with Strings
(https://unicode.org/reports/tr18/#Character_Ranges_with_Strings) doesn't
generate the examples given further.

Here are some of the examples (within character class):
  [a-z\q{x\u{323}}]
  [a-z ñ \q{ch} \q{ll} \q{rr}]

And here is the grammar:
  ITEM := "\q{" (CODE_POINT (SP CODE_POINT)*)? "}"
  SP   := \u{20}

The grammar suggests that a single SP is required between individual
CODE_POINTs. Which if true would be confusing, for example [\q{c h}].

Besides, this ITEM production is supposed to be embedded in CHARACTER_CLASS
grammar (https://unicode.org/reports/tr18/#character_ranges) which
already allows and ignores whitespace:

  >> Whitespace is allowed between any elements, but to simplify the presentation the many occurrences of sequences of spaces (" "*) are omitted.


So I believe what was actually intended is this:
  ITEM := "\q{" CODE_POINT2* "}"

(with whitespace allowed by virtue of being embedded within CHARACTER_CLASS
grammar)

In this scenario [\q{aa ch}] is equivalent to [\q{aach}].


Alternatively, if SP is intended to separate whole strings inside \q{}, then
you need to allow multiple CODE_POINTs without SP between them:

  ITEM := "\q{" CODE_POINT2* (SP CODE_POINT2+)* "}"

In this scenario [\q{aa ch}] is equivalent to [\q{aa}\q{ch}]. But then it
would be very confusing that only \u{20} would act as separator, while
other whitespace like \u{09} wouldn't.

In either case, some examples with spaces inside \q{...} should be given for
clarification.

Feedback routed to Emoji SC for evaluation

(None at this time.)

Feedback routed to Editorial Committee for evaluation

Date/Time: Fri Jul 23 03:04:50 CDT 2021
Name: Liang Hai
Report Type: Error Report
Opt Subject: Obscure statement in section 12.1, Devanagari: Rendering Devanagari

R10 (rule 10) in the subsection “Rendering Devanagari” of the Core Spec’s section 12.1, Devanagari:

> Other modifying marks, in particular bindus and svaras, … The relative placement 
> of these marks is horizontal rather than vertical; the horizontal rendering order may 
> vary according to typographic concerns.

Unclear what “relative placement of these marks is horizontal” and “horizontal rendering 
order may vary” means.

Date/Time: Fri Aug 6 17:02:10 CDT 2021
Name: Peter Constable
Report Type: Other Question, Problem, or Feedback
Opt Subject: bad links in UTS #46

Note: This has already been fixed in Unicode 14.0, for UTS #46 and UTS #39.

In the references section of UTS #46, several of the links to IETF documents are bad: 
some are simply links to anchors within UTS #46 itself (eg., the references for IDNA 2003); 
and some external links are broken (unstable ietf.org URLs?; e.g., links for RFCs 5890, 5891, 
5893, 5894).

The following is an example URL that works (for RFC 5890): https://www.rfc-editor.org/info/rfc5890.

Date/Time: Mon Aug 23 09:11:43 CDT 2021
Name: Marc Lodewijck
Report Type: Public Review Issue
Opt Subject: PRI #433: Typos in NamesList

Let me point out the following typos in NamesList-14.0.0d11.txt:

002F	SOLIDUS
	= slash,forward slash, virgule  ## space missing after the comma

133CC	EGYPTIAN HIEROGLYPH W024
	* phonogramm 'nw'  ## the m is doubled

133CD	EGYPTIAN HIEROGLYPH W024A
	* monogramm 'nw(n)' or 'nww'  ## the m is doubled

133E4	EGYPTIAN HIEROGLYPH Z001
	* semogram index
	* classifier 'single'
	* not to be confuse with 133FA  ## should read 'confused'

Date/Time: Thu Aug 26 10:27:21 CDT 2021
Name: r12a
Report Type: Error Report
Opt Subject: Phonetic typo in Arabic section

On page 394 of v14 the long Kurdish u is described in phonetic notation as

u:   ie.
u U+0075: LATIN SMALL LETTER U
: U+003A: COLON

whereas it should be

uː   ie.
u U+0075: LATIN SMALL LETTER U
ː U+02D0: MODIFIER LETTER TRIANGULAR COLON

Note: The Editorial Committee has already reviewed feedback above this line, as of 2021/09/02.

Date/Time: Wed Sep 15 08:12:31 CDT 2021
Name: Angus Patrick
Report Type: Error Report
Opt Subject: Moon phases Naming

Dear Unicode Emojis,

I have put the type of message as "Error Report". I have done this because I
believe the moon phases emojis (I'm looking at Nos. 947-954) are
erroneously named.

They are labelled waning crescent, waxing gibbous etc when these are not
true descriptions in large parts of the world. For  example, the one
labelled "waning crescent moon" looks like a waxing crescent moon when
looked at from the Southern Hemisphere .

This makes even less sense to people who live close to the equator: in
tropical zones this same crescent appears to lie on its side.

You might say that only a small proportion of the world's population lives
in the Southern Hemisphere but I think it would be unfair, and maybe
discriminatory to ignore their point of view.

I suggest that these emojis be renamed to more generic names to avoid being
offensive.

Sincerely

Angus Patrick

Date/Time: Wed Sep 15 10:03:17 CDT 2021
Name: Giacomo Catenazzi
Report Type: Error Report
Opt Subject: Missing number in codepoint in Kana Extended-B

Page 758 of the Unicode Standard 14.0.0, the sub-chapter title states
"Kana Extended-B: U+AFF0-U+1AFFF", but it should be 
"Kana Extended-B: U+1AFF0-U+1AFFF".

Note: this is an addition (new text) of Unicode 14.0.0

Other Reports

Date/Time: Thu Aug 12 12:26:46 CDT 2021
Name: Assam Association Delhi
Report Type: Error Report
Opt Subject: “Bengali and Assamese” script

Note: This item was directed to the Unicode Consortium staff and has been responded to by the executive director.

The Unicode Consortium
P.O. Box 391476
Mountain View, CA 94039-1476 U.S.A.
+1-408-401-8915

Dear Sir,

Though Unicode is a game changer in today’s cyberspace enabling thousands
of language to interact electronically and in cyberspace, but a few
language viz. Assamese (India) may still face some issues which may
kindly be addressed s viz. a. In the Code Charts
http://www.unicode.org/charts/,  “Bengali and Assamese” script is
appeared in the home-page as one of South Asian Scripts  but in the
linked page i.e. https://www.unicode.org/charts/PDF/U0980.pdf it is
appearing only as “Bengali” (instead of “Bengali and Assamese”). So It is
requested to update it with “Bengali and Assamese” instead of ‘Bengali”
alone. 

b. It is felt that merging Assamese script with Bengali in the same Code
chart may create problem in future viz. 

i. As Assamese scripts are not placed in order in the above Code chart,
sorting of Assamese words through SWs would be difficult due to their
disruptive positions.

ii. Transliteration/ translation may become difficult due to lack of
separate identity of Assamese script.

iii. Assamese language may face incompatibility issue  in  AI , Robotics etc
for the above two reasons .

If the above fear is true, then it is requested to take appropriate action
for the safety of Assamese language . Meanwhile, we are expressing our
keenness  to work jointly with you  to resolve such issues, if any. 

Your sincerely

Dibyojit Dutta
General Secretary, 
Assam Association Delhi


 Copy to 	a. Secretary Ministry of Electronics & IT, Govt of India, New Delhi

L2/21-169