L2/24-223

Comments on Public Review Issues
(July 7 - October 24, 2024)

The sections below contain links to permanent feedback documents for the open Public Review Issues as well as other public feedback as of October 24, 2024, since the previous cumulative document was issued prior to UTC #180 (July 2, 2024).

Contents:

The links below go directly to open PRIs and to feedback documents for them, as of October 24, 2024.

Issue Name Feedback Link
508 Proposed Update UAX #38, Unicode Han Database (Unihan) (feedback)

The links below go to locations in this document for feedback.

Feedback routed to CJK & Unihan Working Group for evaluation [CJK]
Feedback routed to Script Encoding Working Group for evaluation [SAH]
Feedback routed to Properties & Algorithms Working Group for evaluation [PAG]
Feedback routed to Emoji Standard & Research Working Group for evaluation [ESC]
Feedback routed to Editorial Working Group for evaluation [EDC]
Other Reports

 


Feedback routed to CJK & Unihan Working Group for evaluation [CJK]

Date/Time: Sun Jul 07 18:01:08 CDT 2024
ReportID: ID20240707180108
Name: Paul Masson
Report Type: Error Report
Opt Subject: Variants for U+6784 构

This character is listed as its own simplified and traditional variant. 
That is just simply wrong.

Date/Time: Sun Jul 07 21:44:35 CDT 2024
ReportID: ID20240707214435
Name: Paul Masson
Report Type: Error Report
Opt Subject: Variants for U+5978 奸

This character is listed as its own simplified and traditional variant. 
That is just simply wrong.

Date/Time: Mon Jul 08 19:14:05 CDT 2024
ReportID: ID20240708191405
Name: Paul Masson
Report Type: Error Report
Opt Subject: Variants for U+575B 坛

This character is listed as its own simplified and traditional variant. 
That is just simply wrong.

Date/Time: Thu Aug 08 23:14:55 CDT 2024
ReportID: ID20240808231455
Name: Eduardo Marín Silva
Report Type: Public Review Issue
Opt Subject: On the addition of an extra kStrange entry

Character U+3106C has an entry on the kStrange Unihan property but only for 
being "Stroke Heavy" when the fact that it's top and bottom rows do not strecht 
to fill the character cell, makes it my opinion, also a good candidate for the 
"Unusual Arrangenment or Structure" category.

Date/Time: Sat Aug 10 13:51:55 CDT 2024
ReportID: ID20240810135155
Name: M
Report Type: FAQ Suggestion
Opt Subject: Unihan Database Properties (kGB7)

I wonder why only 42 characters have the property kGB7

Date/Time: Mon Aug 26 14:59:02 CDT 2024
ReportID: ID20240826145902
Name: Ken Lunde
Report Type: Error Report
Opt Subject: Unihan database error

The kRSUnicode and kTotalStrokes property values for U+23D92 𣶒 are both incorrect. 
Instead of being 85.8 and 11 respectively, their property values should be 2.8 and 9, 
respectively.

Date/Time: Wed Sep 11 18:35:02 CDT 2024
ReportID: ID20240911183502
Name: Ryusei Yamaguchi
Report Type: Error Report
Opt Subject: the code charts for Unicode version 16.0.0

In the code charts for Unicode version 16.0.0, the glyphs in the J column for the 
following characters do not match the corresponding J-source codes.

"code point","kIRG_JSource","glyph in the chart"
"U+2D0B2","JMJ-059372","MJ068097"
"U+2D4F1","JMJ-059505","MJ068098"
"U+2EA41","JMJ-060341","MJ068100"

Date/Time: Sat Sep 21 06:42:45 CDT 2024
ReportID: ID20240921064245
Name: Andrew West
Report Type: Error Report
Opt Subject: Unihan_Variants.txt

In Unihan_Variants.txt there are these two entries:

U+8AE9	kSimplifiedVariant	U+2C8F2
U+2C8F2	kTraditionalVariant	U+8AE9

However, U+8AE9 諩 is a variant form of U+8B5C 譜, and does not simplify to
U+2C8F2 𬣲. On the other hand, the correct traditional mapping for U+2C8F2 𬣲
is U+8A81 誁.

Therefore remove these two entries:
U+8AE9	kSimplifiedVariant	U+2C8F2
U+2C8F2	kTraditionalVariant	U+8AE9

And add these three entries:
U+8A81	kSimplifiedVariant	U+2C8F2
U+8AE9	kSemanticVariant	U+8B5C
U+2C8F2	kTraditionalVariant	U+8A81

Date/Time: Sat Sep 21 07:07:19 CDT 2024
ReportID: ID20240921070719
Name: Andrew West
Report Type: Error Report
Opt Subject: Unihan_Variants.txt

In Unihan_Variants.txt there are these two entries:

U+292CC	kSimplifiedVariant	U+31071
U+31071	kTraditionalVariant	U+292CC

However, U+292CC 𩋌 (⿰革易) does not simplify to U+31071 𱁱, which is the
simplified form of U+292EC 𩋬 (⿰革昜).

Therefore remove these two entries:
U+292CC	kSimplifiedVariant	U+31071
U+31071	kTraditionalVariant	U+292CC

And add these two entries:
U+292EC	kSimplifiedVariant	U+31071
U+31071	kTraditionalVariant	U+292EC

Date/Time: Sun Oct 06 02:56:51 CDT 2024
ReportID: ID20241006025651
Name: Philippe Verdy
Report Type: Error Report
Opt Subject: /Public/UCD/latest/ucd/CJKRadicals.txt

Note: The editors have evaluated and responded to this report. No further UTC action is necessary.

There's a missing entry in CJKRadical.txt (https://www.unicode.org/Public/UCD/latest/ucd/CJKRadicals.txt) for
an 'unencoded' CJK radical present in the composition of unified ideographs
for modern Chinese, and only represented by the CJK unified ideograph
U+9FBA (龺).

It is an additional variant of the Kangxi radical 159 U+2F9E (⾞), i.e. a
narrowed form of Unified ideograph 𠦝 (U+2099D) used on the left side.

  ...
  158; 2F9D; 8EAB
  159; 2F9E; 8ECA
  159'; 2ECB; 8F66
+ 159''; ; 9FBA
  160; 2F9F; 8F9B
  ...

It should be listed, just like the three other unencoded non-Kangxi CJK
radicals for variants of Kangxi radicals:

  ...
  181'; 2EDA; 9875
  182; 2FB5; 98A8
  182'; 2EDB; 98CE
* 182''; ; 322C4
  183; 2FB6; 98DB
  ...
  207; 2FCE; 9F13
  208; 2FCF; 9F20
* 208''; ; 9F21
  209; 2FD0; 9F3B
  ...
  211; 2FD2; 9F52
  211'; 2EEE; 9F7F
  211''; 2EED; 6B6F
  212; 2FD3; 9F8D
  212'; 2EF0; 9F99
  212''; 2EEF; 7ADC
* 212'''; ; 31DE5
  213; 2FD4; 9F9C
  213'; 2EF3; 9F9F
  213''; 2EF2; 4E80
  ...

Additional question:

Shouldn't these four non-Kangxi CJK radicals (159'', 182'', 208'', 212''')
be encoded ? For example in existing block 2E80-2EFF CJK Radicals
Supplement (where U+2E9A and U+2EF3-2EFF ar still unassigned)?

And then shouldn't the existing IDS (for composite ideographs using them),
be updated in UniHan to preferably use these 4 new  radicals
(where appropriate), rather than their associated unified ideographs
(respectively U+9FBA, U+322C4, U+9F21, U+31DE5).

All this should be done within the existing framework for better
radical-stroke indexes which will use the newly properties added in the
recently released Unicode 16.0.

Feedback routed to Script Encoding Working Group for evaluation [SEW]

Date/Time: Mon Jul 15 14:29:00 CDT 2024
ReportID: ID20240715142900
Name: David Corbett
Report Type: Error Report
Opt Subject: L2/24-182

The problem statement in L2/24-182 says a few things about U+20DD in fonts that are not true.

> The problem with using U+20DD is that it cannot adjust the advance width of the character 
that it encloses, with the consequence that without manual spacing or kerning it will overstrike 
a preceding character.

U+20DD, like any character, can adjust the advance widths of other characters using contextual 
positioning.

> The only non-manual solution would be the impractical one of creating a specialty font with 
a substitution character for every combination of IPA letter and ◌⃝,

A font does not need to use ligature substitutions as this sentence claims. Positioning the 
circle is akin to kerning which is already common in fonts and can be automated.

> with internal anchor points for diacritics that would now need to be input after the circle.

Diacritics could be input before or after the circle. Contextual positioning of U+20DD can 
easily skip intervening marks.

I am not against the proposal itself, but the proposal should not use these reasons as 
motivation. The proposal seems to be saying it would hard to implement a font with the 
proper rendering using U+20DD, so the solution is two new characters with which it would 
also be hard to implement the proper rendering. One good reason for the new characters is 
that they can encircle multiple bases, which U+20DD can’t (since U+034F was changed to 
not support this use case).

Date/Time: Mon Jul 01 11:45:31 CDT 2024
ReportID: ID20240701114531
Name: Guru Prasad
Report Type: Public Review Issue
Opt Subject: 502

Tulu Tigalari adopted for modern Tulu & manuscripts 
Followup suggestion to 
L2/22-068 Apr 15, 2022 response to L2/22-075 

Issue with changing consonant addition using halanth used in all Indic languages to 
a new symbol like a variance of wingding suggested in L2/22-031 and response L2/22-068.

1. Kindly consider using Nukta or other unicode assignment for Visible Virama and leaving 
the current invisible virama as is allowing legacy documents, typing , transliteration 
to happen with ease to and from Tulu-Tigalari. 

Date/Time: Thu Jul 18 15:26:37 CDT 2024
ReportID: ID20240718152637
Report Type: Public Review Issue [SEW]
Name: Philippe Verdy
Opt Subject: 502

Note: This has already been fixed in a subsequent draft.

Minor editorial issue:

The following grouping is used in the current beta charts and name lists for the 
Garay Block (in Unicode 16.0 Draft Public Review):
  ; Marks
  10D6A GARAY CONSONANT GEMINATION MARK
  10D6B GARAY COMBINING DOT ABOVE
  10D6C GARAY COMBINING DOUBLE DOT ABOVE
  ; Punctuation and reduplication mark
  10D6D GARAY CONSONANT NASALIZATION MARK
  10D6E GARAY HYPHEN
  10D6F GARAY REDUPLICATION MARK

However 10D6D GARAY CONSONANT NASALIZATION MARK should be under "Marks" (like 
10D6A GARAY CONSONANT GEMINATION MARK), not under "Punctuation and reduplication mark"

Date/Time: Fri Jul 19 09:48:15 CDT 2024
ReportID: ID20240719094815
Name: Philippe Verdy
Report Type: Public Review Issue [SEW]
Opt Subject: 502

Representative glyph for 18CFF (KHITAN SMALL SCRIPT CHARACTER-18CFF)

The current draft chart indicates that this is representing a missing or
illegible character (this is then intended for long term usage in encoded
texts, for reference, rather then inserting "educated guesses").

However the representative glyph for now just shows a basic square, which
looks too much as "tofu" (used when there's no font available, and where an
alternate representation using graphics could be used, e.g. on the web), or
like regular geometric shape.

We have much enough regular rectangular shapes in Unicode. Let's not abuse
it for something intended to be unreadable/obscure. Older terminal
protocols used black squares or checkerboards, or patterns, or some
bordered or hollow question mark.

My opinion is that this glyph should better be some irregular (not purely
rectangular) shape (e.g. with some missing corners), like a partially burn
paper sheet, and with dashed or dotted borders possibly filled with
irregular checkerboard or pseudo-random dots or strokes (not near the
damaged corners/borders where they could be bolder or could simulate a
shadowing effect).

A question mark (possible rotated or mirrored) may also be added on top of
that shape.

Another good glyph could be a backward slanted mirrored question mark,
hollowed, or inverted inside a "warning triangle" or some irregular dotted
rectangle (possibly not fully closed, with a missing corner at the bottom
right). It should however adopt the ideographic metrics of other Khitan
letters.

We should be more imaginative, while avoiding visual confusion with other
regular characters (from any script or set of symbols).

Date/Time: Fri Aug 09 15:44:45 CDT 2024
ReportID: ID20240809154445
Name: Eduardo Marín Silva
Report Type: Public Review Issue
Opt Subject: On the newly approved Greek characters

Recently new Greek letters and modifier letters were approved for phonetic
notation to be included in the Latin Extended G block (next to related IPA
characters). I advise that these characters are reassigned to the Greek and
Coptic block, as well as podsibly the Greek Extended block in the following
way: the three letters with palatal hook can be placed in the 0380-0382 and
the two modifier letters can go in 1F7E-1F7F or alternatively in
0378-0379. 

While placing Greek letters along with Latin letters has been done before,
that block was under the generic name of Phonetic Extensions, named that
way precisely because letters of different scripts could occupy it. 

While the risk of confusion is minor, I don't believe it's worth breaking
with precedent when a more elegant solution is available.

The modifier letters in particular, are bound to have a larger demand due to
them being superscript versions of letters in the basic Greek alphabet. So
it would be quite odd to find them in a Latin specific block that is not
even in the BMP. 

Date/Time: Fri Aug 09 15:54:24 CDT 2024
ReportID: ID20240809155424
Name: Eduardo Marín Silva
Report Type: Public Review Issue
Opt Subject: On the newly approved Hiragana ligature

Three Kana Ligatures have been approved and assigned into one of the Kana
Extension blocks. I advise that the Hiragana Digraph Koto be reassigned to
3040 in the main Hiragana block. While I would suggest the same for the
Katakana ligatures, unfortunately the Katakana block is fully occupied. 

Date/Time: Wed Aug 14 06:16:38 CDT 2024
ReportID: ID20240814061638
Name: Charlotte Buff
Report Type: Error Report
Opt Subject: L2/24-080

U+1AE9 was accepted for a future version under the name COMBINING LEFT ANGLE 
CENTERED ABOVE (cf. 179-C58). For consistency with existing character names 
(which use British spelling), the name should be spelled COMBINING LEFT ANGLE 
*CENTRED* ABOVE instead.

Feedback routed to Properties & Algorithms Working Group for evaluation [PAG]

Date/Time: Wed Jul 31 03:01:40 CDT 2024
ReportID: ID20240731030140
Name: Rossen Mikhov
Report Type: Error Report
Opt Subject: UAX #14: Unicode Line Breaking Algorithm

https://www.unicode.org/reports/tr14/#CJ 
Version: Unicode 15.1.0
Date: 2023-08-15
Revision: 51

Location:
5.1 Description of Line Breaking Properties
CJ: Conditional Japanese Starter

Problematic text:
CSS Text Level 3 (which supports Japanese line layout) defines three distinct values for its line-break behavior:
• strict, typically used for long lines
• normal (CSS default), the behavior typically used for books and documents
• loose, typically used for short lines such as in newspapers

Possible correction:
Delete "(CSS default)".

Explanation:
In CSS, at least in the current CSS Text Level 3 Candidate Recommendation,
and the latest CSS Text Level 4 Working Draft, the default line-break
behavior is not "normal". It is "auto", which basically means the browser
can do whatever it wants by default. Indeed, my Firefox by default does not
break before small hiragana. It does when "line-break: normal" is
explicitly specified.

https://www.w3.org/TR/css-text-3/#line-break-property 
https://www.w3.org/TR/2024/WD-css-text-4-20240529/#line-break-property 

Date/Time: Wed Jul 31 08:12:26 CDT 2024
ReportID: ID20240731081226
Name: Rossen Mikhov
Report Type: Error Report
Opt Subject: UAX #14: Unicode Line Breaking Algorithm

https://www.unicode.org/reports/tr14/#LB9 
Version: Unicode 15.1.0
Date: 2023-08-15
Revision: 51

Location: 6.1 Non-tailorable Line Breaking Rules
[LB9] "Treat X (CM | ZWJ)* as if it were X (where X is any line break class except BK, CR, LF, NL, SP, or ZW)."
[LB12] "GL ×"

Problem:
U+034F COMBINING GRAPHEME JOINER is in Mn, but its line breaking class is GL, not CM. 
This causes unexpected behavior when GCJ is used in the middle of a combining character sequence.

Take the following two sequences:
(1) <u, COMBINING DIAERESIS, EM DASH>
(2) <u, CGJ, COMBINING DIAERESIS, EM DASH>
In (1), a line break is allowed before EM DASH (which has line breaking class B2).
In (2), LB9 applies with CGJ taking the place of X, then LB12 kicks in to forbid a line break before the EM DASH.

How I came up with the example: Section 23.2 "Layout Controls" of the
Unicode Standard explicitly mentions the use of CGJ in German text to make
a distinction between u-umlaut (which is sorted like <u,e>) and
u-diaeresis (which is sorted like “u” with a secondary weight). The
distinction is purely for collation and it doesn't make sense for such CGJ
to affect line breaking behavior after the umlaut/diaeresis.

This is impossible to solve without separating CGJ in a different line
breaking class from NBSP (currently both are GL). To see this, observe that
in sequence (2) above, if NBSP were used in place of CGJ, the suppression
of the line break before EM DASH is exactly the expected behavior.

This is also impossible to solve by tailoring, as CM and GL are
non-tailorable classes, and LB9 and LB12 are non-tailorable rules.

While at it, I will also point out a typo:
[LB10] "Treat any remaining CM or ZWJ as it if were AL."
In this definition, the order of "it" and "if" should be reversed.

Date/Time: Thu Aug 01 09:18:31 CDT 2024
ReportID: ID20240801091831
Name: Rossen Mikhov
Report Type: Error Report
Opt Subject: UAX #14: Unicode Line Breaking Algorithm

https://www.unicode.org/reports/tr14/#LB15b 
Version: Unicode 15.1.0
Date: 2023-08-15
Revision: 51

Location: LB15a, LB15b

I found the following document which describes these new rules:
https://www.unicode.org/L2/L2023/23063-break-quot-mark.pdf 

Reading through it, it seems that the inclusion of WJ and SY in LB15b
(but not in LB15a) might have been accidental, and not really intended by
the author. Perhaps it is an artifact of importing the rules from another
representation.

Regarding WJ, it seems strange that SP×Pf×WJ, i.e. that WJ should
act-at-a-distance across the quotation mark. If somebody actually used WJ
after Pf, they probably intended to prevent a line break to the right of
Pf, not to the left. Yes, such WJ is redundant in the current version of
the algorithm, but implementations deviate (especially Far Eastern
implementations tend to allow line breaks much more often), so the WJ might
be there in the text for a valid real-world reason. Given that SP×Pf×WJ
doesn't seem to have any merit for French (somebody able to type WJ in
French could just type <SP,WJ,Pf>, after all), I believe WJ should
not be included in LB15b. Including it in LB15b penalizes a user who is
mindful about their line breaks (explicitly using WJ), for the sake of
somebody who is not careful enough to put the WJ at the correct place.

Regarding SY, the slash »/« is often used in Unix paths, such as »/usr/bin«.
I am not familiar with the particulars of French usage, but does it occur «
comme ça »/ frequently enough (without a space before the slash) to merit
inclusion in LB15b? If it does, then it probably also occurs with the same
frequency /« comme ça », so it doesn't make sense to include it in LB15b
but not in LB15a.

If WJ and SY are included in LB15b purely for a technical reason (to ease
implementations using a particular kind of software), and that reason is
important enough to merit complicating the user-facing semantics of WJ,
then this should probably be stated in the text.

Date/Time: Mon Aug 05 05:53:22 CDT 2024
ReportID: ID20240805055322
Name: Rossen Mikhov
Report Type: Error Report
Opt Subject: UAX #14: Unicode Line Breaking Algorithm

https://www.unicode.org/reports/tr14/#Examples 
Version: Unicode 15.1.0
Date: 2023-08-15
Revision: 51

Location: 8.2 Examples of Customization, Example 7

Problematic text:

The tailoring can be accomplished by first segmenting the text into grapheme
clusters according to the rules defined in UAX #29, and then finding line
breaks according to the default line break rules, as follows: After
applying the mandatory line break rules, give each grapheme cluster the
line breaking class of its first code point.

Explanation:

This text was changed recently to avoid recommending a non-conforming tailoring:

https://www.unicode.org/L2/L2022/22244-utc173-properties-recs.pdf 

I agree that with this change the UAX no longer formally contradicts itself,
but it still doesn't mean the approach gives sensible results.

Here is an example of misbehavior if the wording of the problematic text is
taken at face value:

<U+1112,U+1161,U+11AB, U+1100,U+1173,U+11AF> (literally: 한글)

These are two Korean syllables, each composed of three code points: a
leading consonant, a vowel, and a trailing consonant. Segmenting into
grapheme clusters will produce two clusters, one for each syllable. If, as
the text suggests, we give each cluster the line breaking class of its
first code point, this would give each cluster the incorrect line breaking
class JL (the class for leading consonants) instead of the correct H3
(the class for three-component syllables). Since the line breaking
algorithm does not allow line breaks between leading consonants, there will
be no line breaks in the entire sequence.

Now these are just two Korean syllables, so the missed line breaking
opportunity between them may not matter, but the same logic holds for an
arbitrary long sequence of Korean syllables, potentially forbidding any
line breaks in a long run of Korean text.

Another possible example of misbehavior is a sequence of several Emoji
flags, e.g. <RI,RI, RI,RI>. Segmenting into grapheme clusters will
group together pairs of Regional Indicators, then giving each pair the line
breaking class RI will result in prohibition of line breaks between
pairs-of-pairs. This is probably not what was intended.

I have not worked out the details for cases of
Grapheme_Cluster_Break=Prepend, but they should probably be verified, and
then again for each new update of UAX #29, because the segmentation logic
tends to get more and more complicated over the years.

In summary, I think it is better not to mislead the reader that it is a
simple matter to tailor the line breaking algorithm to work sensibly on
grapheme cluster boundaries. Either a complete working solution should be
offered, or the reader should be warned of the existence of potential
problems.

Date/Time: Mon Aug 05 06:23:35 CDT 2024
ReportID: ID20240805062335
Name: Rossen Mikhov
Report Type: Error Report
Opt Subject: UAX #14: Unicode Line Breaking Algorithm

https://www.unicode.org/reports/tr14/#Examples 
Version: Unicode 15.1.0
Date: 2023-08-15
Revision: 51

Location: 8.2 Examples of Customization, Example 7

I would like to add to the feedback that I submitted on this topic a few
minutes ago.

Maybe a workable approach would be:

1. Run both the segmentation algorithm and the line breaking algorithm in
parallel, unmodified.

2. Delete the line breaking opportunities that happen to fall within
grapheme clusters.

If 2. deletes a non-tailorable line breaking opportunity (produced by rules
LB2-LB12), then this means the problem is impossible to solve in the first
place.

It would be nice to also verify that it is impossible for 2. to delete too
many line breaking opportunities, producing long runs of legitimate text
without line breaks.

Date/Time: Thu Aug 08 21:51:58 CDT 2024
ReportID: ID20240808215158
Name: Marcel Schneider
Report Type: Error Report
Opt Subject: TUS

Hello,

The Unicode Standard misadvises about composing custom vulgar fractions, as it 
recommends breaking spaces to separate integers and vulgar fractions. It even 
recommends U+200B:

“If the fraction is to be separated from a previous number, then a space can 
be used, choosing the appropriate width (normal, thin, zero width, and so on). 
For example, 1 + thin space + 3 + fraction slash + 4 is displayed as 1¾.”
https://www.unicode.org/versions/Unicode15.0.0/UnicodeStandard-15.0.pdf#page=302&zoom=100,0,400 

Although it was intended to be no-break, the Unicode THIN SPACE U+2009 is 
breaking. So is the ZERO-WIDTH SPACE U+200B, but by design.

The text of TUS is the more inadequate as there is no space between the 
integer and the precomposed fraction.

I’d suggest changing this to:

A preceding integer part must be separated from the digits composing the 
fraction. This can be achieved using any of U+200C ZERO WIDTH NON-JOINER, 
U+2060 WORD JOINER, U+202F NARROW NO-BREAK SPACE, or another no-break 
character of the appropriate width.

I noted this already on 2023-08-31T0736+0200 and came across it again now 
while documenting source code and keyboard layouts.

Best regards,

Marcel Schneider

Date/Time: Fri Aug 09 21:37:02 CDT 2024
ReportID: ID20240809213702
Name: Robert Thomson
Report Type: Error Report
Opt Subject: Unicode Standard Annex #42


With respect to UAX #42 for unicode version 15.1.0 at
https://www.unicode.org/reports/tr42/#d1e3008 viewed 2024-08-10, I believe
there are a couple of minor errors:

In section 4.4.2 Name properties, the character name has a pattern option
of <control>.  None of the codepoints have that pattern, and I
believe that with revision 9 and the introduction of the name alias pattern
there is no longer the requirement to include "|(<control>)" in the
character name pattern.

[name pattern, 12] = 
  character-name = xsd:string { pattern="([A-Z0-9 #\-\(\)]*)|(<control>)" }


If you should agree with the previous conclusion then Section 12 contains an
example fragment that is also in error

<char cp="001F" age="1.1" na="&lt;control&gt;" na1="UNIT SEPARATOR"
            gc="Cc" bc="S" lb="CM"/>

Date/Time: Thu Sep 19 09:19:51 CDT 2024
ReportID: ID20240919091951
Name: Malo
Report Type: Error Report
Opt Subject: MathClass


As of Unicode 15, in MathClass documents
(https://www.unicode.org/Public/math/revision-15/*), the character U+22A5 ⊥
UP TACK is classified as a Relation (R). This is contradictory with its use
as a value (class N for Normal) in many fields such as logic and type
theory (where it is often referred to as "bot," or "bottom"). In fact,
U+22A4 ⊤ UP TACK ("top"), which is used along with top in those fields, is
classified as Normal (N).

This is likely due to a confusion with the homoglyphic perpendicular symbol
(U+27C2 ⟂ PERPENDICULAR), which is correctly classified as a Relation
(R). It is this exact difference between bot being used as a value and the
perpendicular sign being used as a relation that lead to the introduction
of those two distinct characters in Unicode, according to this 2003 draft:
https://www.unicode.org/L2/L2003/03194-math-letterlike.pdf.

As a final note, bot was initially properly classified as Normal (N) in
Unicode 9
(https://www.unicode.org/Public/math/revision-09/MathClass-9.txt), but this
changed with Unicode 11. If this change was intentional, I think this
oddity deserves a comment in the MathClass files to inform the reader that
this is not a mistake, and a short explanation.

Date/Time: Mon Oct 21 14:42:36 CDT 2024
ReportID: ID20241021144236
Name: Huáng Jùnliàng
Report Type: Error Report
Opt Subject: UTS #18

In section 1.2.5, there is a table containing General Category Property
values and three star entries, Any, Assigned and ASCII. Although there is a
note that starred entries in the table are not part of the enumeration of
General_Category values, it may still be a little bit confusing as one
browser engine maintainer interprets[1] that ASCII belongs to General
Category:

> Yes, but that means that they are not part of the enumeration of values
 and not that they don't belong to that category. I.e. they are not listed
 as being part of that categories in UnicodeData.txt.

Can we we improve the text and/or the table layout to clarify that Any,
Assigned and ASCII are not a General_Category property value?

[1]: https://issues.chromium.org/u/0/issues/373759990#comment5
 

Feedback routed to Emoji Standard & Research Working Group for evaluation [ESR]

Date/Time: Thu Jul 18 18:33:41 CDT 2024
ReportID: ID20240718183341
Name: Peter G Constable
Report Type: Public Review Issue
Opt Subject: 496

Note: This report is about a proposed update and the error has been fixed in the released version.

I recognize this is a late report, but I just noticed this typo in PU UTS #51, 
in section 2.6. In revision 26 (2024-6-26), the first sentence of section 2.6 has 
the following (revised) wording:

"There are several emoji that depict more than one person interacting. When 
implemented with a choice or genders or skin tones, special handling is 
required on a case-by-case basis."

The phrase "choice or genders or skin tones" appears to have a typo: I assume 
what is intended is "choice of genders or skin tones".

Feedback routed to Editorial Working Group for evaluation [EDC]

Date/Time: Fri Jul 26 02:41:07 CDT 2024
ReportID: ID20240726024107
Name: Werner Lemberg
Report Type: Error Report [EDC]
Opt Subject: NamesList.txt

As discussed in the thread starting at

  https://corp.unicode.org/pipermail/unicode/2024-July/010976.html 

it turned out that the two characters

  1D132   MUSICAL SYMBOL QUARTER TONE SHARP
  1D133   MUSICAL SYMBOL QUARTER TONE FLAT

are not accidentals but *pitch modifiers*, to be added to left of an
accidental (or a note without an accidental) and indicating that the pitch
of the given note has to be raised or lowered by a quarter tone,
respectively.  The provided scans in the discussion confirm this usage.

In other words, these two characters should be put into a separate section
`@ Pitch modifiers` or something like that.

Date/Time: Thu Aug 08 09:06:48 CDT 2024
ReportID: ID20240808090648
Name: Lucas
Report Type: Error Report
Opt Subject: Multiple

The Latin Letters D, K, L, N and R as used in Livonian, Old-Prussian,
Latvian and Romanian (all around the Baltic area) are supposed to have a
comma underneath, and not a cedilla. I have not found a single source that
needs these letters with an actual cedilla, other than errors caused by
you, Unicode. According to Wikipedia these letters were mistakenly encoded
with a Cedilla by Unicode in the early nineties, and that Unicode claims
these errors can not be fixed, (even though, in general, the computer world
is all about bugfixing). These letters should not combine with 0327, but
with 0326, as you probably know, since the font used in your charts shows a
proper comma-accent. The Calibri font fonts I designed also use comma
accents.

Your Unicode-bugs are the cause of many fonts actually using cedillas
instead of comma accents. Your bug has also caused the recent DIN 91379
Norm to include sequences for these letters combined with 0326 comma
accent, instead of using the existing Unicodes of the precomposed letters.

If you, for whatever reason, refuse to fix the bugs introduced by your
predecessors, than at least add notes to ALL of these 10 codepoints, in
your charts, that this was a historic mistake, and that the accents should
actually look like free floating comma accents (0326) and not cedillas
(0327). 

1E10 Ḑ LATIN CAPITAL LETTER D WITH CEDILLA (0044 + 0327)
1E11 ḑ LATIN SMALL LETTER D WITH CEDILLA (0064 + 0327)
0136 Ķ LATIN CAPITAL LETTER K WITH CEDILLA (004B + 0327)
0137 ķ LATIN SMALL LETTER K WITH CEDILLA (006B + 0327)
013B Ļ LATIN CAPITAL LETTER L WITH CEDILLA (004C + 0327)
013C ļ LATIN SMALL LETTER L WITH CEDILLA (006C + 0327)
0145 Ņ LATIN CAPITAL LETTER N WITH CEDILLA (004E + 0327)
0146 ņ LATIN SMALL LETTER N WITH CEDILLA (006E + 0327)
0156 Ŗ LATIN CAPITAL LETTER R WITH CEDILLA (0052 + 0327)
0157 ŗ LATIN SMALL LETTER R WITH CEDILLA (0072 + 0327)

ASAP please, thank you.

Date/Time: Sat Aug 31 22:24:11 CDT 2024
ReportID: ID20240831222411
Name: Guillaume Fortin-Debigaré
Report Type: Error Report
Opt Subject: Unicode 15.1.0 Core Specifications - Chapter 22 Symbols

Note: This error has been fixed in the Unicode 16.0 core spec.

Table 22-5 "Mathematical Operators Disunified from Punctuation" lists the incorrect 
Unicode code point for the SOLIDUS character in the second row of the left column. 
If should be 002F instead of 003F.

Date/Time: Sat Sep 07 05:14:42 CDT 2024
ReportID: ID20240907051442
Name: Ivan Panchenko
Report Type: Error Report
Opt Subject: U0000.pdf

A minor slip: In U0000.pdf, the following is shown with two right single quotation 
marks (they are not ASCII apostrophes!) instead of a left and a right one:

  for ’Greek question mark’

Date/Time: Wed Sep 11 04:07:14 CDT 2024
ReportID: ID20240911040714
Name: Ivan Panchenko
Report Type: Error Report
Opt Subject: U2100.pdf

There are two issues with the informative aliases “first transfinite
cardinal (countable)”, “second transfinite cardinal(the continuum)”, “third
transfinite cardinal (functions of a real variable)” and “fourth
transfinite cardinal” for the characters U+2135 (ALEF SYMBOL), U+2136
(BET SYMBOL), U+2137 (GIMEL SYMBOL) and U+2138 (DALET SYMBOL),
respectively.

1) Aleph is used together (!) with 0, 1, … as an index to indicate
cardinalities of well-ordered infinite sets (in ascending order).
(Without an index, it is apparently sometimes used for the cardinality of
the continuum, not the first transfinite cardinal!) Beth and gimel are also
used with an index (you can look up the definition), while daleth does not
have an established meaning and was apparently just included in LaTeX so
that it can be used in an ad-hoc manner. (Even if there is someone out
there who uses the characters as the aliases indicate, that would be an
idiosyncrasy that does not deserve mention in the only alias.)

2) That the cardinality of the continuum is the second transfinite cardinal
amounts to the continuum hypothesis, which is known to be independent of
the set theory ZFC, and among those set theorists who have a belief either
way, it seems like most believe it to be false.

Date/Time: Wed Sep 11 05:14:22 CDT 2024
ReportID: ID20240911051422
Name: Ivan Panchenko
Report Type: Error Report
Opt Subject:

Two further remarks:

1) The reference glyph for U+3388 and that for U+3389 have an
italicized “cal” for the calorie. This unit symbol should not be
italicized. While the glyphs are not normative, it would be great if this
could be corrected; an italic mu (in glyphs of the chart) has already been
corrected to an upright one.

2) The character U+2263 (≣ STRICTLY EQUIVALENT TO) is found under the
subhead “Relations”. I think it would be more appropriate to put it
under “Logical operator” (for comparison: U+2227) because it stands for a
connective in modal logic. See here:
https://corp.unicode.org/pipermail/unicode/2022-July/010231.html 

Date/Time: Tue Sep 24 06:11:31 CDT 2024
ReportID: ID20240924061131
Name: Ben Harris
Report Type: Error Report
Opt Subject: The Unicode® Standard Version 16.0 – Core Specification

A piece of text has been lost in the translation to HTML for Unicode 16.  In
Unicode 15.1.0, this text appears:

"So for example, the representation of the number 12,346 in the traditional
 system would be by a sequence of CJK ideographs with numeric values as
 follows: <one, ten-thousand, two, thousand, three, hundred, four, ten,
 six>."

That is, the example is "one, ten-thousand, two, thousand, three, hundred,
four, ten, six", surrounded by less-than and greater-than signs.

In Unicode 16.0.0, at
https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-22/#G46185,
the same sentence reads:

"So for example, the representation of the number 12,346 in the traditional
 system would be by a sequence of CJK ideographs with numeric values as
 follows: ."

That is, the entire text within and including the less-than and greater-than
signs has vanished.  The HTML source shows that the text does actually
appear in the source, but the less-than sign has not been properly escaped
and so is interpreted as markup by browsers.

This makes me suspect that there may be other similar problems elsewhere in
the standard.  I haven't (yet) made any attempt at looking for them.

Date/Time: Fri Oct 04 11:39:10 CDT 2024
ReportID: ID20241004113910
Name: Malo
Report Type: Error Report
Opt Subject: The Unicode® Standard Version 16.0 Core Specification

Section 24.1.9 of the Unicode® Standard Version 16.0 Core Specification
(https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-24/#G3725)
includes sample character list which contains a mistake: 212B Å ANGSTROM
SIGN is incorrectly marked as having the canonical mapping 00C5 Å angstrom
sign, instead of 00C5 Å latin capital letter a with ring above. Note that
this error is not present in the corresponding chart
(https://www.unicode.org/charts/PDF/U2100.pdf).

Date/Time: Sun Oct 06 16:19:03 CDT 2024
ReportID: ID20241006161903
Name: Jim DeLaHunt
Report Type: Error Report
Opt Subject: www.unicode.org/versions/latest/

Passing on a social media comment about page at
https://www.unicode.org/versions/latest/ . Reader visits the page wanting
to find the Core Spec (can generalise other parts of the Unicode Standard
such as UTRs). Reader expects that the page will contain links to the parts
of the core spec which they seek. Instead, the page describes the
differences between the latest version of TUS and the previous version. I
suggest adding a section to the top of this page, describing "The current
version of The Unicode Standard is 16.0.0. It consists of a Core
Specification (link), some Code Charts (link), etc. Then put the current
content under a heading like "Differences from previous version of the
Standard". 

The present set of links, especially the unnumbered list of links under "B.
Technical Overview", might make the reader hope they link to the parts of
the Standard, but in fact they link to subheadings below which describe
changes. It would be better for the list of links at the top of the page be
to the parts of the latest version of The Unicode Standard, as implied by
the URL.

Original social media post:
https://cosocial.ca/@timbray/113170595870924709 , by Tim Bray of XML fame.
Relayed by Jim DeLaHunt. The explanation above is mine, not Tim's. He may
submit his own Error Report in his own words.

Date/Time: Thu Oct 24 10:04:37 CDT 2024
ReportID: ID20241024100437
Name: Sridatta A
Report Type: Error Report
Opt Subject: Corrections to Unicode chapter of Tulu-Tigalari

In chapter 15
https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-15/#G71814 
“Tulu-Tigalari is a historic script attested in a large number of manuscripts 
from Karnataka and northern Kerala dating to as early as 1300 CE. It was used 
to write Sanskrit, Tulu, and Malayalam, “
Should be corrected to have Kannada instead of Malayalam.
In #Figure 15-5. The glyph is that of ju than chu 

Other Reports