Public Review Issues

Accumulated Feedback on PRI #502

This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.

Date/Time: Wed Apr 24 12:13:30 CDT 2024
ReportID: ID20240424121330
Name: Ned Holbrook
Report Type: Error Report
Opt Subject: Unicode 16.0 Core Spec [EDC]

Table 12-38 is missing a couple of space characters, namely in “0D310D31” and “0D2A0D31”.

I would also note in passing that it is somewhat jarring to note just how
many ways there are of formatting sequences in this chapter: Table 12-32
lists code points separated by commas in angle brackets, Table 12-35 lists
code points separated by commas with no angle brackets, Table 12-37 lists
code points interspersed with descriptions, Tables 12-38 and 12-39
list code points separated by spaces, and Table 12-40 has parallel lists of
descriptions and code points separated by commas. While I would not assume
a single format is best for every purpose, it does seem that there could be
more consistency in this chapter at least.

Date/Time: Sat May 04 17:43:05 CDT 2024
ReportID: ID20240504174305
Name: Alexander Kunde
Report Type: Error Report
Opt Subject: Bamum Supplement Block [SEW]

this concerns seemingly faulty character names in the Bamum Supplement
Block. Presuming that the phonetic readings (that served as the basis for
the character names) as given in the underlying proposal (N3597, L2/09-102
with N3523, L2/09-106) are correct and following the conventions specified
therein (on p. 3), there are seemingly typos in the following character
names:

1680B "MAEMBGBIEE" for MAEMGBIEE (məmgbie)
16881 "PUNGAAM" for PUNGGAAM (puŋgaam)
1688E "NGOM" for NGGOM (ŋgɔm)
168DC "SETFON" for SHETFON (ʃɛtfɔn)
16963 "MBAA SEVEN" for SAMBA (samba)
1697D "NGOP" for NGGOP (ŋgɔp).

For two further characters, 16839 FIRI ("firʼi") and 16A24 NI ("nʼi"), the
phonetic source form contains an apostrophe, for which however no
conversion is indicated. Might those not be either, resp., FIR-I and N-I,
or, if the apostrophe is a variant for ʔ, FIRQI and NQI?

Note that, writing from Germany, I myself can't actually read the script nor
speak the underlying language and have no connections to the user
community. I merely noticed discrepancies between the columns (phonetic vs.
en vs. fr) in the indicated proposal (pp.21 ff.).

Date/Time: Tue May 07 18:02:20 CDT 2024
ReportID: ID20240507180220
Name: Markus Scherer
Report Type: Error Report
Opt Subject: TUS table 4-5 Primary Numeric Ideographs [EDC]


Eric Muller noticed that TUS table 4-5 shows U+5146 with the value 1,000,000,000,000 (10,000 × 10,000 × 10,000)
which since Unicode 15.1 is no longer the Numeric_Value of that code point.

See
https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-4/#G138783 
https://www.unicode.org/Public/15.0.0/ucd/extracted/DerivedNumericValues.txt 
https://www.unicode.org/Public/15.1.0/ucd/extracted/DerivedNumericValues.txt 

The kPrimaryNumeric value is
1000000 1000000000000
(with two values separated by a space).
The first one of these is the Numeric_Value.

Also, kPrimaryNumeric has data for 20 code points, while table 4-5 shows only 17.

Date/Time: Wed May 08 07:23:49 CDT 2024
ReportID: ID20240508072349
Name: Mikhail Merkuryev
Report Type: Error Report
Opt Subject: Supplemental Arrows-C [SEW]


Change egyptologic arrows 1F8C0 and C1 from hollow to simple. As L2/23-185
says, the arrows don’t need to be hollow.

Date/Time: Wed May 15 10:35:38 CDT 2024
ReportID: ID20240515103538
Name: Mikhail Merkuryev
Report Type: Public Review Issue
Opt Subject: DoNotEmit.txt [PAG]

DoNotEmit.txt: Add to “Discouraged” or “Preferred spelling” decomposition of
those Cyrillic letter known by me:

Ёё Йй Ўў

(Cyrillic capital/small letter Io, Cyrillic capital/small letter Short I,
Cyrillic capital/small letter Short U)

e.g.
0415 0308 → 0401   # Cyrillic capital letter Ie + combining diaeresis → Cyrillic capital letter Io

Maybe others, but I don’t know.

Її (Cyrillic capital/small letter Yi) is tricky and IDK what to do:
discouraged in decomposed form in modern Ukrainian text, but maybe allowed
in Old Slavonic.

Rationale: most Cyrillic fonts do not lay combining marks properly, and
common breve has other shape different from Cyrillic. And these four
letters in modern shape are really distinct entities.

Date/Time: Mon May 20 12:39:57 CDT 2024
ReportID: ID20240520123957
Name: Ben Scarborough
Report Type: Public Review Issue
Opt Subject: 502 [PAG]

Note: This report duplicates report #ID20240112220043 filed against PRI #489, and will be handled there.

DoNotEmit.txt currently includes the following line:

0149; 02BC 006E; Deprecated # LATIN SMALL LETTER N PRECEDED BY APOSTROPHE; MODIFIER LETTER APOSTROPHE, LATIN SMALL LETTER N

The character in question, U+0149 LATIN SMALL LETTER N PRECEDED BY
APOSTROPHE, has had the Deprecated property since Unicode 5.2.0. According
to L2/08-287, the character was deprecated because its decomposition used
the wrong apostrophe character: RIGHT SINGLE QUOTATION MARK is the
preferred character for Afrikaans, not MODIFIER LETTER APOSTROPHE.

The line in DoNotEmit.txt should use the preferred string instead of
U+0149's compatibility decomposition. The line should be changed to:

0149; 2019 006E; Deprecated # LATIN SMALL LETTER N PRECEDED BY APOSTROPHE; RIGHT SINGLE QUOTATION MARK, LATIN SMALL LETTER N

Date/Time: Tue May 21 18:27:20 CDT 2024
ReportID: ID20240521182720
Name: Erik Carvalhal Miller
Report Type: Public Review Issue
Opt Subject: 502 [EDC]

Note: This has been fixed in a subsequent draft of the core spec.

Chapter 22, §22.7.4 [https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-22/#G78435], 
¶5 (“A set of ASCII digits 0 through 9…”): “ASCI” → “ASCII”, “charcter” → “character” (in “Outlined 
uppercase Latin letters and ASCI digits from the European charcter set for the Sharp MZ-series machines…”

Date/Time: Wed May 22 04:02:17 CDT 2024
ReportID: ID20240522040217
Name: Charlotte Buff
Report Type: Public Review Issue
Opt Subject: 502 [PAG]

I propose adding Duployan (Dupl) to the Script_Extensions for the following
code points based on annotations in the names list for the Duployan block,
the contents of UTN #37, “Duployan Shorthand”, and the original encoding
proposal for Duployan, L2/10-272r2:

	U+00B7 MIDDLE DOT
	U+0300 COMBINING GRAVE ACCENT
	U+0301 COMBINING ACUTE ACCENT
	U+0304 COMBINING MACRON
	U+0306 COMBINING BREVE
	U+0307 COMBINING DOT ABOVE
	U+0308 COMBINING DIAERESIS
	U+030A COMBINING RING ABOVE
	U+0323 COMBINING DOT BELOW
	U+0324 COMBINING DIAERESIS BELOW
	U+0331 COMBINING MACRON BELOW
	U+2E3C STENOGRAPHIC FULL STOP

Duployan for Romanian also makes use of U+00B0 DEGREE SIGN in numerical
contexts, though as this character is in common use in a variety of writing
systems and has no explicit Script_Extensions as of now there would likely
be little benefit to specifically listing just Duployan.

Date/Time: Thu May 23 09:34:28 CDT 2024
ReportID: ID20240523093428
Name: Charlotte Buff
Report Type: Public Review Issue
Opt Subject: 502 [PAG]

DoNotEmit.txt currently includes the following line:

	13217; 13216 13430 13216 13430 13216; Precomposed_Hieroglyph # EGYPTIAN HIEROGLYPH N035A; EGYPTIAN HIEROGLYPH N035, EGYPTIAN HIEROGLYPH VERTICAL JOINER, EGYPTIAN HIEROGLYPH N035, EGYPTIAN HIEROGLYPH VERTICAL JOINER, EGYPTIAN HIEROGLYPH N035

However, section 11.4.3 of the core spec specifically states:

»For example, U+13217 𓈗 EGYPTIAN HIEROGLYPH N035A apparently could
 be represented by the sequence <13216, 13430, 13216, 13430,
 13216>. However, this compound sign is considered a single
 entity in Ancient Egyptian by Egyptologists, because the compound
 sign conveys a function that is not covered by the meaning of its
 individual parts. As a result, the atomic character U+13217 should
 be used.«

I do not know which representation is actually the preferred one, so either
this DoNotEmit entry or this section of the core spec should be removed.

Date/Time: Thu May 23 09:57:09 CDT 2024
ReportID: ID20240523095709
Name: Charlotte Buff
Report Type: Public Review Issue
Opt Subject: 502 [PAG]

The two new CJK strokes, U+31E4 CJK STROKE HXG and U+31E5 CJK STROKE SZP,
currently have no explicit Script_Extensions. They should be given the
property value “Hani” like all the other CJK strokes (U+31C0..U+31E3).

Date/Time: Sat Jun 01 11:53:22 CDT 2024
ReportID: ID20240601115322
Name: Sridatta A
Report Type: Public Review Issue
Opt Subject: 502 [EDC]

Updating the Tirhuta chapter of Core Specification.
“ and in the Narayani and Janakpur zones of Nepal. ”
Nepal currently doesn’t use Zones for administrative divisions 
since 2015.

According to the current classification, Maithili is majorly 
spoken in Madhesh and Koshi  provinces.

https://en.m.wikipedia.org/wiki/Maithili_language

Date/Time: Thu Jun 06 17:22:34 CDT 2024
ReportID: ID20240606172234
Name: Debbie Anderson
Report Type: Public Review Issue
Opt Subject: 502 [SEW]

I checked with the Egyptologists and they confirmed the currently commented out 
Standardized Variants should remain commented out, but one additional sequence 
should ALSO be commented out:
1333B FE00; rotated 90 degrees; # EGYPTIAN HIEROGLYPH U007

Date/Time: Sat Jun 08 19:25:17 CDT 2024
ReportID: ID20240608192517
Name: Jules Bertholet
Report Type: Public Review Issue
Opt Subject: 502 [EDC]

From §5.8.2 of the core spec <https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-5/#G21129>:

> This is a paragraph with a line separator at this point,
>
> causing the word “causing” to appear on a different line, but not causing the typical 
> paragraph indentation, sentence breaking, line spacing, or change in flush (right, center, 
> or left paragraphs).

However, the paragraph in question actually uses a paragraph separator, not a 
line separator. `</p>` should be replaced with `</br>` in the HTML.

Date/Time: Wed Jun 19 07:31:03 CDT 2024
ReportID: ID20240619073103
Name: Vaishnavi Murthy Yerkadithaya
Report Type: Public Review Issue
Opt Subject: 502 [EDC/Charts]

Editorial Note: Please refer to https://www.unicode.org/cgi-bin/GetDocumentLink?L2/24-149
for detailed comments on https://www.unicode.org/charts/PDF/Unicode-16.0/U160-11380.pdf

Date/Time: Wed Jun 19 09:19:47 CDT 2024
ReportID: ID20240619091947
Name: Charlotte Buff
Report Type: Public Review Issue
Opt Subject: 502 [PAG]

Currently, U+19DA NEW TAI LUE THAM DIGIT ONE has Line_Break=Complex_Context
while all the other digit characters of the New Tai Lue script
(U+19D0..U+19D9) have Line_Break=Numeric. For consistency, I propose
changing U+19DA to Line_Break=Numeric as well.

Date/Time: Mon Jun 24 14:43:20 CDT 2024
ReportID: ID20240624144320
Name: Norbert Lindenberg
Report Type: Public Review Issue
Opt Subject: 502 [SEW]

Unicode 16.0 will have 5 characters with Indic syllabic category
Consonant_Preceding_Repha. Such characters represent non-spacing marks, but
are encoded in phonetic order before the consonant on top of which they’re
rendered, and therefore have general category Lo.

The representative glyphs for such characters in the code charts and, where
shown, in the core specification have an enclosing dashed box to reflect
their unusual properties.

There’s an inconsistency in what’s shown inside that box: Most
representative glyphs show the repha glyph by itself, but the one for
Tulu-Tigalari shows the repha glyph on top of a dotted circle.

I think showing the repha mark on top of a dotted circle actually makes
sense.

Affected characters:
0D4E ; Consonant_Preceding_Repha # Lo MALAYALAM LETTER DOT REPH
113D1 ; Consonant_Preceding_Repha # Lo TULU-TIGALARI REPHA
11941 ; Consonant_Preceding_Repha # Lo DIVES AKURU INITIAL RA
11D46 ; Consonant_Preceding_Repha # Lo MASARAM GONDI REPHA
11F02 ; Consonant_Preceding_Repha # Lo KAWI SIGN REPHA

Sources:
https://www.unicode.org/Public/draft/UCD/charts/CodeCharts.pdf 
https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-17/#G41865

Date/Time: Tue Jun 25 04:59:29 CDT 2024
ReportID: ID20240625045929
Name: Richard Ishida
Report Type: Public Review Issue
Opt Subject: 502 [PAG]

The Do Not Emit data file contains the following lines.

---
ب ٔ; ࢡ; Hamza_Form # ARABIC LETTER BEH, ARABIC HAMZA ABOVE; ARABIC LETTER BEH WITH HAMZA ABOVE
ح ٔ; ځ; Hamza_Form # ARABIC LETTER HAH, ARABIC HAMZA ABOVE; ARABIC LETTER HAH WITH HAMZA ABOVE
ر ٔ; ݬ; Hamza_Form # ARABIC LETTER REH, ARABIC HAMZA ABOVE; ARABIC LETTER REH WITH HAMZA ABOVE
---

These mappings are valid for orthographies that use the atomic character as
a letter of the alphabet, but they are not appropriate for Kashmiri, which
uses the hamza as a vowel diacritic, not as an ijam.

See
https://r12a.github.io/scripts/arab/ks.html#non_canonical 
https://r12a.github.io/scripts/arab/homographs.html#nehomographs 

Although the hamza is not a tashkil, the distinction made here follows the
logic in the standard related to ijam vs tashkil usage. See
https://r12a.github.io/scripts/arab/homographs.html#ijam_tashkil 

Having special rules for just a few, arbitrary combinations of hamza and
base in Kashmiri is likely not only to lead to inconsistency in encoding,
leading to failures in searching and other operations, but it is also a
recipe for confusion for users. Note that all other uses of the vowel hamza
above a base character in Kashmiri have no corresponding ijam (and if
there's a possibility that atomic characters for these pairings may be
created for other languages in the future this adds further complexity).

It seems to me that one solution to this would be to add some sort of
qualification, by language, for these entries. Or perhaps it would be
helpful to make these combinations canonically equivalent and remove them
from Do Not Emit.  Users would then be able to type the items either way,
and end up with compatible text.

Date/Time: Wed Jun 26 12:31:03 CDT 2024
ReportID: ID20240626123103
Name: Charlotte Buff
Report Type: Public Review Issue
Opt Subject: 502 [EDC]

Note: This has been fixed in a subsequent draft of the names list for 16.0.

I propose adding cross references between U+2BFA ⯺ UNITED SYMBOL and 
U+1CC88 𜲈 TWO RINGS ALIGNED HORIZONTALLY because of their similar appearance.

Date/Time: Wed Jun 26 13:09:26 CDT 2024
ReportID: ID20240626130926
Name: Peter Constable
Report Type: Public Review Issue
Opt Subject: 489 [EDC]

Note: This has been fixed in a subsequent draft of the names list for 16.0.

In the code chart for Garay 
(https://www.unicode.org/charts/PDF/Unicode-16.0/U160-10D40.pdf), the names
list has a subhead "Punctuation and reduplication mark" immediately before
U+10D6D GARAY CONSONANT NASALIZATION MARK. That character would fit better
within the scope of the preceding subhead, "Marks". Proposed change: move
the "Punctuation..." subhead after U+10D6D.

Date/Time: Sat Jun 29 15:24:10 CDT 2024
ReportID: ID20240629152410
Name: Charlotte Buff
Report Type: Public Review Issue
Opt Subject: 502 [SEW]

The following characters currently have Script=Common and Script_Extensions={Common}:

	U+16EB RUNIC SINGLE PUNCTUATION
	U+16EC RUNIC MULTIPLE PUNCTUATION
	U+16ED RUNIC CROSS PUNCTUATION

I could not find any mention anywhere in the Unicode Standard of these characters being 
used in any script besides Runic, though it is possible they may be. At the very least 
Runic (Runr) should be added to their Script_Extensions.

Date/Time: Tue Jul 02 15:32:12 CDT 2024
ReportID: ID20240702153212
Name: Karl Pentzlin
Report Type: Public Review Issue
Opt Subject: 502 Unicode 16.0.0 Beta [EDC/Charts]


On a discussion of some symbol characters (L2/23-152) at the ongoing SC2/WG2
meeting in Prague, there were some misunderstandings, as looking at the
Unicode code tables only, it was not obvious which characters in fact are
Emoji.

Thus, it seems advisable to get an easily accessible information in the code
chart, whether

— a character is "emoji by default", i.e. listed in emoji-sequences.txt as
  Basic_Emoji, but without FE0F in the first column,

— or a character is "selectable as emoji" by the variation selector U+FE0F,
  i.e. listed in emoji-sequences.txt as Basic_Emoji, together with FE0F in
  the first column.

I had mailed this to Asmus Freytag as the author of the Unibook software. In
his answer, he recommended me to outline the problem in a response to the
Unicode 16.0 beta review (however, I will not hurry anyone to discuss this
issue before Unicode 17). As he wrote, this would focus on the use case of
not being able to tell something that so fundamentally affects the identity
of a character from looking at the code charts. Particularly, as for emoji,
the representative glyph in the code chart lacks the relevance that it has
for other characters and may, in fact be misleading. It can be noted, that
the code charts already indicate those characters, for which there is a
standardized variant, and for which, therefore, the sole representative
glyph may not be giving the full information.