L2/23-159

Comments on Public Review Issues
(April 6 - July 4, 2023)

The sections below contain links to permanent feedback documents for the open Public Review Issues as well as other public feedback as of July 4, 2023, since the previous cumulative document was issued prior to UTC #175 (April 2023).

Contents:

The links below go directly to open PRIs and to feedback documents for them, as of July 4, 2023.

Issue Name Feedback Link
482 Proposed Draft UTR #56, Unicode Cuneiform Sign Lists (feedback) No feedback at this time
481 Proposed Update UAX #42, Unicode Character Database in XML (feedback) No feedback at this time
480 Unicode 15.1.0 Beta (feedback)
479 Proposed Update UTS #53, Unicode Arabic Mark Rendering (feedback) No feedback at this time
478 Proposed Update UTS #46, Unicode IDNA Compatibility Processing (feedback) No feedback at this time
477 Proposed Update UTS #10, Unicode Collation Algorithm (feedback) No feedback at this time
476 LDML (UTR#35) Part 7: Keyboards (feedback)
475 Proposed Update UTS #18, Unicode Regular Expressions (feedback)  No feedback at this time
474 Draft UTS #55, Unicode Source Code Handling (feedback)
471 Proposed Update UTS #51, Unicode Emoji (feedback)
470 Proposed Update UAX #24, Unicode Script Property (feedback) No feedback at this time
469 Proposed Update UAX #29, Unicode Text Segmentation (feedback)
468 Proposed Update UAX #45, U-source Ideographs (feedback)
467 Proposed Update UAX #38, Unicode Han Database (Unihan) (feedback)
465 Proposed Update UAX #44, Unicode Character Database (feedback)
464 Proposed Update UAX #41, Common References for Unicode Standard Annexes (feedback) No feedback at this time
463 Proposed Update UTS #39, Unicode Security Mechanisms (feedback)
462 Proposed Update UAX #31, Unicode Identifiers and Syntax (feedback)
461 Proposed Update UAX #14, Unicode Line Breaking Algorithm (feedback)
460 Proposed Update UAX #9, Unicode Bidirectional Algorithm (feedback)

The links below go to locations in this document for feedback.

Feedback routed to CJK & Unihan Group for evaluation [CJK]
Feedback routed to Script ad hoc for evaluation [SAH]
Feedback routed to Properties & Algorithms Group for evaluation [PAG]
Feedback routed to Emoji SC for evaluation [ESC]
Feedback routed to Editorial Committee for evaluation [EDC]
Other Reports

 


Feedback routed to CJK & Unihan Group for evaluation [CJK]

Date/Time: Mon May 22 18:07:09 CDT 2023
ReportID: ID20230522180709
Name: Paul Masson
Report Type: Error Report
Opt Subject: kPhonetic for U+645E


This character appears on p.350 of Casey but is not in a phonetic group. It
appears that the appropriate one is 842. This was not added in version 15.
Thank you.

Date/Time: Mon May 22 18:07:42 CDT 2023
ReportID: ID20230522180742
Name: Paul Masson
Report Type: Error Report
Opt Subject: kPhonetic for U+773E


This character is a variant of U+8846. It appears in Casey in the same group
324. Please add this entry to you database. Thank you.

Date/Time: Mon May 22 18:08:30 CDT 2023
ReportID: ID20230522180830
Name: Paul Masson
Report Type: Error Report
Opt Subject: kPhonetic for U+78D7


This character had a kPhonetic value of 269 in version 13, which was changed
in version 14 to 1157*. It disappered from the database in version 15 when
the latter group was radically pruned, as needed to occur. Please add the
correct entry to the database. Thank you.


Feedback routed to Script ad hoc for evaluation [SAH]

Date/Time: Mon Apr 17 17:12:14 CDT 2023
ReportID: ID20230417171214
Name: David Corbett
Report Type: Other Document Submission
Opt Subject: Feedback on L2/23-102

On page 3, the glyph for LATIN SMALL LETTER R WITH LEFT TIE in the code chart 
is a ligature of U+0279 LATIN SMALL LETTER TURNED R and U+0072 LATIN SMALL 
LETTER R. However, that does not match any of the attestations of this character 
in any of the figures in this proposal. Instead, they all consistently make it 
look like U+0072 LATIN SMALL LETTER R with a preceding diagonal stroke. The 
Unicode code chart glyph should match the attested glyphs.

Date/Time: Thu May 18 11:42:42 CDT 2023
ReportID: ID20230518114242
Name: Charlotte Buff
Report Type: Other Document Submission
Opt Subject: On the name of KHITAN SMALL SCRIPT CHARACTER-18CFF

The proposed character U+18CFF KHITAN SMALL SCRIPT CHARACTER-18CFF
(cf. L2/23-065) which was recently accepted for a future version of the
standard is not a normal character of the Khitan small script, but instead
acts as a placeholder for characters that have been lost or are illegible.
I propose changing its name to KHITAN SMALL SCRIPT LOST SIGN to reflect
that special purpose.

Unlike Han or Tangut ideographs, the names of the characters in the Khitan
Small Script block are all explicitly defined in UnicodeData.txt, so I do
not think it is strictly necessary for U+18CFF to also follow the same
algorithmic naming scheme – unless of course some internal tool I am
unaware of requires it, in which case this proposal can be discarded.

Date/Time: Mon Jun 26 12:48:27 CDT 2023
ReportID: ID20230626124827
Name: David Corbett
Report Type: Other Document Submission
Opt Subject: Name of U+1CE07

U+1CE07 TOP RIGHT BLACK LEFT-POINTING SMALL TRIANGLE (approved for Unicode 16.0) 
has a glyph in the top left of the cell, according to L2/21-235R. Shouldn’t it be 
named TOP LEFT BLACK LEFT-POINTING SMALL TRIANGLE?

Date/Time: Mon Jun 26 21:54:09 CDT 2023
ReportID: ID20230626215409
Name: David Corbett
Report Type: Other Document Submission
Opt Subject: Feedback on L2/23-147

U+1E6FE TAI YO SYMBOL MEUANG is represents “mương”, according to L2/22-289. 
The nucleus “ươ” /ɨə/ is ASCIIfied “UEA” in U+1E6EA TAI YO LETTER UEA. 
Therefore, U+1E6FE should be named “TAI YO SYMBOL MUEANG”.

Date/Time: Mon Jul 03 12:42:20 CDT 2023
ReportID: ID20230703124220
Name: Little Miss MOSFET
Report Type: Error Report
Opt Subject: Duployan Bloc Errors at U1BC00.pdf

Dear Unicode Consortium,

For years, evidently with no report, the Duployan code block has been SNAFU.
I’m merely a user for many years of Duployan to write Chinukwawa, not a
profound techie, so please forgive any issues with my submittal, but I’d
like to press on these errors.

https://www.unicode.org/charts/PDF/U1BC00.pdf 

Lists the Unicode Duployan characters as currently drafted in the standard.
Each character which contains a little arrow is incorrect.  *There are no
little arrows in Duployan.*  These we’re evidently included by mistake, as
the proposal to include this block described the characters’ kerning
direction using these little arrows.  *These were obviously not intended to
be part of the standard.*  The little arrows describe the characters as
they link together and direction of writing in the unicode inclusion
proposal.  *They are not, nor ever have been, a part of these characters.*
These little arrows should be deleted.

Furthermore: Duployan script works sort of like Arabic when written.  It has
a complex kerning which moves left to right and top to bottom.  There is as
yet no functional font for Duployan, and apparently no description of how
these characters link up in Unicode, though this was described in the
proposal.  *That is to say, the state of Duployan as specified by Unicode
is incomplete and unusable.*  I’d very much like to see this resolved
eventually.  If you need any additional info, please contact me as above.

Thanks,
Little Miss MOSFET

Date/Time: Mon Jul 03 12:52:12 CDT 2023
ReportID: ID20230703125212
Name: Little Miss MOSFET
Report Type: Error Report
Opt Subject: PS on Duployan Bloc Errors - Inclusion Proposal Document

https://www.unicode.org/L2/L2010/10272r-duployan.pdf 

This is Van Anderson’s proposal.  From the textual examples, you can see
that the arrows were not meant to be part of the standard, as they are used
by the author to describe the direction of a character’s writing and
rotation for linkage.  None of the primary sources use these little arrows.
They are not part of Duployan, but an erroneous Unicode Consortium
artifact.  But because of their inclusion, fonts which include Duployan
usually copy these little arrows.

Feedback routed to Properties & Algorithms Group for evaluation [PAG]

Date/Time: Tue May 02 07:23:16 CDT 2023
ReportID: ID20230502072316
Name: Charlotte Buff
Report Type: Other Document Submission
Opt Subject: Text segmentation properties of Kirat Rai vowel signs

The vowel signs of the Kirat Rai script, which has been accepted for a
future version of the Unicode Standard based on proposal document
L2/22-043R, are slated to be implemented as spacing, stand-alone
characters (gc=Lo) rather than as combining or spacing marks. While not
explicitly stated, this would likely result in them being assigned the
Grapheme_Cluster_Break property value Other (GCB=XX). Three of these vowel
signs – AI, O, and AU – are visually sequences of other vowel signs and
have therefore been given canonical decomposition mappings:

	U+16D68 ≡ <U+16D67, U+16D67>	AI ≡ <E, E>
	U+16D69 ≡ <U+16D63, U+16D67>	O  ≡ <AA, E>
	U+16D6A ≡ <U+16D69, U+16D67>	AU ≡ <O, E>

These properties, however, do not maintain canonical equivalence. The vowel
signs in question would be one grapheme cluster each in NFC, but two
grapheme clusters each in NFD. This is forbidden by UAX #29, which states
in section 2, “Conformance”:

	»A boundary exists in text not normalized in form NFD if and only if
	 it would occur at the corresponding position in NFD text.«

There are several possible approaches for resolving this issue:

1) Reclassify Kirat Rai vowel signs as spacing, combining marks

	A minimal solution that preserves canonical equivalence for both
	legacy and extended grapheme clusters would involve U+16D67
	KIRAT RAI VOWEL SIGN E and U+16D68 KIRAT RAI VOWEL SIGN AI
	being changed to Grapheme_Cluster_Break=Extend
	(GCB=EX). Though not strictly necessary, it would then also
	make sense to change their General_Category value to
	Spacing_Mark (gc=Mc). 
	
	This approach may not be desirable because it would prevent vowel
	signs E and AI from being used in isolation; they would
	always forcibly “glue” themselves to the preceding character
	such as a space or a punctuation mark and potentially cause
	problems for the text renderer. The stand-alone nature of
	the Kirat Rai vowel signs was quite a deliberate choice
	because of the similarities to the New Tai Lue script.

2) Invent new GCB rules for these vowel signs

	The text segmentation algorithm would need to be amended to make
	Kirat Rai vowel signs similar in nature to Hangul Jamo –
	forming grapheme clusters with each other in certain
	configurations, but not with unrelated characters. For
	minimal impact, the new rule should be limited to the
	interaction between vowel signs E, AA, and O followed
	directly by vowel sign E, which covers all three
	decomposition mappings. It could look something like this:
	
		[\u{16d63}\u{16d67}\u{16d69}] × \u{16d67}
	
	Note that U+16D67 occurs on both sides of the rule because it is
	both the leading and the trailing codepoint in the
	decomposition mapping of U+16D68.
	
	This approach is probably a cleaner solution because it gets rid of
	the problem without changing anything about the general
	nature of the script, but it also introduces a unique edge
	case into an otherwise quite straightforward algorithm for
	the sake of just a handful of characters.

3) Change decompositions from canonical to compatibility

	There is no requirement for compatibility decompositions to preserve
	the text segmentation boundaries of their source strings. In
	practice, users of the script would always encounter the
	vowel signs in precomposed form because NFKC and NFKD are
	generally not used on the front end, while search and
	collation algorithms would still be able to recognise the
	weak equivalence.
	
	However, it is questionable whether using mere compatibility
	equivalence for sequences that are truly identical in every
	sense is appropriate, especially in the context of security.

4) Do not encode compound vowel signs as separate characters

	The characters U+16D68..U+16D6A would be removed from the Kirat Rai
	repertoire altogether and the only way to represent vowel
	signs AI, O, and AU would be through the use of sequences.
	Perhaps named character sequences could be defined as well
	if deemed useful.
	
	This approach would circumvent the entire issue without side
	effects, but is also clearly the least desirable for actual
	users of the script who consider these vowel signs to be
	linguistic units regardless of their glyphic appearance. I
	do not think this would be an acceptable solution in practice.

5) Encode the vowel signs as atomic characters without decomposition mappings

	This approach is the worst one in my view as it would necessitate
	the creation of dreaded Do Not Use tables for the Kirat Rai
	script, which goes against everyone’s interests. I strongly
	recommend against this solution.

Date/Time: Tue Jun 13 08:12:45 CDT 2023
ReportID: ID20230613081245
Name: Jae Woong Lee
Report Type: Error Report
Opt Subject:


Hello,

I am using unicode 9.0 with mysql 8.0 database.
collation name: utf8mb4_0900_ai_ci
I can't get the desired result when I compare the Korean string using unicode 9.0.
unicode 9.0 considers separated characters and combined characters as the same thing.

ex)
- 요 = 요 -> result True : correct
- 요 = ㅇㅛ -> result True : This is an invalid result.

But if I use other collations, utf8mb4_general_ci, utf8mb4_unicode_ci, 
I get the correct result.

ex)
- 요 = 요 -> result True : correct
- 요 = ㅇㅛ -> result False : corrent

It seems that the Korean comparison method is different from 9.0. I'm
wondering why characters that look different to Koreans are called the same
in unicode 9.0. Is this by design or is it a bug and can it be fixed? I
contacted mysql, but they told me that it's not a mysql issue, but to
contact the unicode association because they used unicode 9.0 as it is.

-----------------------------------
[9 Jun 16:10] MySQL Verification Team
Hi,

You can observe that collating the constants changing the result.
You can try different COLLATE expressions.
Regarding Koreans language, we are not experts on this. 
We just implemented the UTF standard, to the last point.
Hence, you should contact the people that define Unicode standards.
Also, do not forget that two strings with different grapheme clusters 
can be considered identical, as per standard. There are many examples in 
the textbooks on this subject.

Not a bug.
-----------------------------------



Regards,
Jae.



Feedback routed to Emoji SC for evaluation [ESC]

(None at this time.)


Feedback routed to Editorial Committee for evaluation [EDC]

Date/Time: Thu Jun 01 05:25:23 CDT 2023
ReportID: ID20230601052523
Name: Roozbeh Pournader
Report Type: Error Report
Opt Subject: Core Specification

The Lao chapter in the Core Spec is missing any information on spacing. I believe at 
minimum we need to copy some of the information from the Thai section or refer to 
the Thai section about spacing.

This came to light because of a comment made by Norbert Lindenberg that suggested to 
me U+200B is also used in Lao. But there is no such reference in the Core Spec.

Other Reports

(None at this time.)