L2/21-125

Comments on Public Review Issues
(April 22 - July 20, 2021 )

The sections below contain links to permanent feedback documents for the open Public Review Issues as well as other public feedback as of July 20, 2021, since the previous cumulative document was issued prior to UTC #168 (July 27, 2021).

Contents:

The links below go directly to open PRIs and to feedback documents for them, as of July 20, 2021.

Issue Name Feedback Link
433 Unicode 14.0.0 Beta (feedback)
432 Proposed Update UAX #50, Unicode Vertical Text Layout (feedback) No feedback at this time
431 Proposed Update UAX #42, Unicode Character Database in XML (feedback)
430 Proposed Update UTS #51, Unicode Emoji (feedback)
429 Proposed Update UTS #46, Unicode IDNA Compatibility Processing (feedback)
427 Proposed Update UTS #18, Unicode Regular Expressions (feedback)
426 Proposed Update UTR #53, Unicode Arabic Mark Rendering (feedback) No new feedback since last meeting
425 Proposed Update UTS #10, Unicode Collation Algorithm (feedback) No feedback at this time
424 Proposed Update UAX #31 Unicode Identifier and Pattern Syntax (feedback) No feedback at this time
423 Proposed Update UTS #39 Unicode Security Mechanisms (feedback)
422 Proposed Update UAX #9, Unicode Bidirectional Algorithm (feedback) No feedback at this time
421 Proposed Update UAX #38, Unicode Han Database (Unihan) (feedback)
420 Proposed Update UAX #45, U-source Ideographs (feedback) No new feedback since last meeting
419 Proposed Update UAX #44, Unicode Character Database (feedback) No new feedback since last meeting
417 Proposed Update UAX #29, Unicode Text Segmentation (feedback)
416 Proposed Update UAX #14, Unicode Line Breaking Algorithm (feedback) No feedback at this time
415 Proposed Update UTR #23, The Unicode Character Property Model (feedback) No feedback at this time
408 QID Emoji (feedback)

The links below go to locations in this document for feedback.

Feedback routed to Unihan ad hoc for evaluation
Feedback routed to Script ad hoc for evaluation
Feedback routed to Properties & Algorithms ad hoc for evaluation
Feedback routed to Emoji SC for evaluation
Feedback routed to Editorial Committee for evaluation
Other Reports

 


Feedback routed to Unihan ad hoc for evaluation

(None at this time.)


Feedback routed to Script ad hoc for evaluation

Date/Time: Sun Jun 13 19:07:29 CDT 2021
Name: Eduardo Marín Silva
Report Type: Feedback on an Encoding Proposal
Opt Subject: On the proposed Book Pahlavi encoding model

This is a response to this document:
https://www.unicode.org/L2/L2021/21090-book-pahlavi-model.pdf called
L2/21-090. I would like to mention that I find the proposed model mostly
appropriate and I would like to commend the work of all the
contributors.

I only have three suggestions:

A) Encode the double curled tooth (samekh) as a separate character, to be
consistent with the other "regular" teeth. In page 13, the author
mentions: 

  "The [double tooth] is encoded as a separate character in order to enable
   typographical support for different representations of aleph-heth in
   initial, medial, and final position." followed by...

  "The ligature of [triple tooth] aleph-heth+gimel-daleth-yodh is encoded as
   an atomic character in order to enable typographical support for
   different representations of it, as compared to [double tooth]"

This in my opinion merits treating them separately. But it also means that
the double and triple regular teeth have separate characters, but the
double tooth doesn't. In my opinion it's better to treat all teeth alike,
because not only does that mean that the "regular samekh letter" can be
treated as a unit, but it also expands the same benefits of treating the
double and triple tooth atomically to the double curled tooth. If
necessary, decomposition sequence can be added to the double and triple
variants of the teeth, making them pre-composed characters. The name of the
character can be "double curled tooth-samekh"

B) Treat most of the contextual forms as a sequence of two characters. In
page 12,10 contextual forms of 6 different letters are proposed as atomic
characters. I believe that the "short waw-nun-ayin-resh" and the "final
pe-sadhe" have enough technical justification for encoding, so this section
does not concern them. The rest of them are "bellied" variants of other
letters like zayin and lamedh, each with "half" and "full" bellies. In my
opinion this is unnecessarily redundant, given that the separate bellies
are going to be encoded separately anyway. These could be easily rendered
by a sequence of the base letter and the desired belly
(e.g. "zayin" + "half belly character" or "zayin" + "full belly
character").

C) Change the name of the belly characters by adding the "full" prefix as
appropriate. Like it was stated before, the belly primitives are encoded
atomically, and due to the rendering requirements, it necessitates a "half
belly" variant apart from the "full belly". None of that is an issue, and I
must say is quite an ingenious solution; I just would change the name of
the "bellies" to "full bellies", that reduces the change of confusion,
since if someone reads the word "belly" in isolation the reader can't know
if it refers to all bellies in general or just the ones that aren't halved.
It also has the effect of associating the concepts of a "belly full of
food" and "half filled belly of food", strengthening their relations and
identities. The names would therefore read:

  Full Belly
  Half Belly
  Full Straight Belly
  Half Straight Belly
  Full Curled Belly
  Half Curled Belly

In summary, my changes would add one more character, remove 8 other
characters and rename 3 other characters. I hope for the Book Pahlavi
script to be accepted soon, for Unicode 15. My dearest wishes: Eduardo.

Date/Time: Tue Jun 15 13:45:10 CDT 2021
Name: David Corbett
Report Type: Feedback on an Encoding Proposal
Opt Subject: Comment on L2/21-107

L2/21-107 proposes “that spacing superscript й, ў, ҫ, ҙ etc. [...] be 
typeset with diacritics”. Because U+04AB CYRILLIC SMALL LETTER ES WITH 
DESCENDER and U+0499 CYRILLIC SMALL LETTER ZE WITH DESCENDER are encoded 
without decompositions, if modifier letter versions of them are attested, 
shouldn’t the modifier letter versions be encoded without decompositions too?

Feedback routed to Properties & Algorithms ad hoc for evaluation

(None at this time.)


Feedback routed to Emoji SC for evaluation

Date/Time: Fri Jun 18 09:57:23 CDT 2021
Name: Charlotte Buff
Report Type: Other Question, Problem, or Feedback
Opt Subject: Implications of new emoji proposal guidelines on Extended_Pictographic property

The new guidelines for submitting emoji proposals, published 15th April,
contain the following caveat:

	»Submissions proposing to emojify existing Unicode 
	characters will not be accepted.«

Does this mean that no already existing character that isn’t an emoji now is
ever going to receive emoji status in the future, or merely that the UTC
will not consider requests specifically asking for the emojification of
existing characters, but that such emojifications may still take place
through other processes?

If the former, this new policy has interesting implications for the
Extended_Pictographic property. Extended_Pictographic was originally
created to future‐proof the line breaking and text segmentation behaviour
of ZWJ sequences. By preemptively assigning Extended_Pictographic=True to
non‐emoji characters with emoji‐like qualities – the implication being that
said characters could one day become emoji themselves – even systems that
haven’t kept up with the latest emoji release would still be able to handle
new ZWJ sequences correctly.

However, if characters are now locked into their emojiness the moment they
are encoded, this aspect of the property has become obsolete. There
currently exist over 600 characters with Extended_Pictographic=True but
Emoji=False. Under a strict interpretation of the new guidelines, they
should be excluded from the Extended_Pictographic set going forward since
they can never become emoji anyway, and according to definitions ED‑15a and
ED‑16 in UTS #51, only emoji can formally be part of ZWJ sequences – ZWJ
sequences being the sole application of the Extended_Pictographic property.
The practice of marking unassigned ranges of codepoints reserved for future
emoji use as Extended_Pictographic would continue as usual.

While most characters in the intersection of Extended_Pictographic=True and
Emoji=False would make for poor emoji candidates, there are a few symbols
in there that could proof to be popular with users, so it is not unlikely
that the UTC will receive proposals for emoji that pretty much already
exist in Unicode. However, the new wording seems to suggest that in such
cases, an entirely new character would be encoded regardless. While I
personally think that emoji presentation is an immensely unfortunate
property for a codepoint to have, I also believe that it goes against the
spirit and purpose of the Unicode Standard to encode two separate versions
of the exact same abstract character just because they are expected to be
displayed with different fonts.

Feedback routed to Editorial Committee for evaluation

Date/Time: Mon Jun 14 16:23:25 CDT 2021
Name: Eduardo Marín Silva
Report Type: Feedback on an Encoding Proposal
Opt Subject: On the response of the editorial comitee on my suggested modifications

This is a response to document L2/21-106: https://www.unicode.org/L2/L2021/21106-u14-annotation-resp.pdf 

I would like to begin by expressing my gratitude and delight at the answer
of the editorial committee. I hope this can serve as an opportunity for
greater engagement between me and the body in the future.

I see the proposed inclusions as a great compromise between my proposal and
the status quo. Some mistakes are either a wrong glyph being used or a
copy-paste error. Ignoring those, I would only like to make a few further
suggestions.

EXCLAMATION MARK: Add a reference to the upcoming 'MEDIEVAL EXCLAMATION
MARK'

NUMBER SIGN: Add a note next to one of the informative alias "= octothorpe
(originating from Bell Labs the spelling of this alias is inconsistent)"
source: https://en.wikipedia.org/wiki/Number_sign#Names_of_the_character 

DOLLAR SIGN: Mention the fact that it is often used as a generic currency
sign.

AMPERSAND: Tweak the wording of the bullet note "• originally a ligature of
the letter 'e' and 't' from the Latin 'et'" Also retain the reference
to '2227 LOGICAL AND'

COMMA: Add a reference to its ancestor '2E12 HYPODIASTOLE'

FULL STOP: Retain the reference to '00B7 MIDDLE DOT'

QUESTION MARK: Remove the references to 2048 and 2049, given that they are
redundant on the presence of the reference to '2047 DOUBLE QUESTION MARK'
Also, add a reference to the upcoming 'MEDIEVAL QUESTION MARK'

COMMERCIAL AT: Add a bullet note saying "• originally used for an archaic
unit of weight in Spain, called 'arroba'"

LATIN CAPITAL LETTER C: Remove the reference to the 'CYRILLIC CAPITAL LETTER
ES', that letter is considered to be part of the basic alphabet. Including
it would necessitate adding references to all other Greek and Cyrillic
homoglyphs (like the lunate sigma symbol at 03F9), however excluding the
letters of the basic alphabet seems to be a good compromise. A small
exception can be made for the capital letter iota, listed under the capital
I, since it completes the set of references nicely. Also I suggest keeping
the reference to '2104 CENTRE LINE SYMBOL', but not repeat it under the
capital L.

LATIN CAPITAL LETTER P: Retain the reference to '214A PROPERTY LINE SYMBOL',
but don't repeat it under capital L

LATIN SMALL LETTER E: Retain the reference to 'AB32 LATIN SMALL LETTER
BLACKLETTER E'

LATIN SMALL LETTER F: Retain the references to the letters '0192 LATIN SMALL
LETTER F WITH HOOK' and 'AB35 LATIN SMALL LETTER LENIS F' and '03DC GREEK
LETTER DIGAMMA'

LATIN SMALL LETTER L: Retain the reference to '01C0 LATIN LETTER DENTAL
CLICK' but don't repeat it under capital I

LATIN SMALL LETTER O: Retain the reference to 'AB3D LATIN SMALL LETTER
BLACKLETTER O'

LATIN SMALL LETTER R: Retain the references to 'AB47 LATIN SMALL LETTER R
WITHOUT HANDLE' and 'AB4B LATIN SMALL LETTER SCRIPT R' (do note that the
suggestions under the small letters e, f, l, o and r, are made due to their
confusability).

TILDE: Retain the reference to '301C WAVE DASH'.

BROKEN BAR: Mention the fact, that the BROKEN BAR was originally an
allograph of '007C VERTICAL LINE' Source:
https://en.wikipedia.org/wiki/Vertical_bar#Solid_vertical_bar_vs_broken_bar 

MULTIPLICATION SIGN: reword the informative alias to say "= Cartesian
product (z notation)

LATIN CAPITAL LETTER O WITH STROKE: Add a reference to the upcoming 'LATIN
CAPITAL LETTER OLD POLISH O' since this character was the typical
replacement up until now

LATIN SMALL LETTER SHARP S: Add a reference to the upcoming 'LATIN SMALL
LETTER MIDDLE SCOTS S'

LATIN SMALL LETTER AE: Add a reference to '1D6B LATIN SMALL LETTER UE'

LATIN SMALL LETTER O WITH STROKE: Add a reference to the upcoming 'LATIN
SMALL LETTER OLD POLISH O' since this character was the typical replacement
up until now

LATIN SMALL LETTER THORN: Add a reference to the upcoming 'LATIN SMALL
LETTER DOUBLE THORN'

That would be all.

Date/Time: Wed Jun 16 00:20:22 CDT 2021
Name: Neal Raulerson
Report Type: Error Report
Opt Subject: Correction in Standard p.126 D93b a.

Instead of:
"a. the initial subsequence of a well-formed code unit sequence..."

I think it is supposed to be:
"a. the initial subsequence of an ill-formed code unit sequence..."

It makes more sense that way. Please let me know, thanks!

Date/Time: Sun Jun 27 12:38:08 CDT 2021
Name: Alexei Chimendez
Report Type: Error Report
Opt Subject: Use of CANCEL TAG in emoji flags

UTS #51 allows for the interchange of various flags through "emoji tag
sequences", specified as: an emoji character or sequence, followed by one
or more component characters from the block Tags, and terminated with the
character CANCEL TAG.

In the Unicode Standard, sec. 23.9 reads:

> There are two uses of cancel tag. To cancel a tag value of a particular
 type, prefix the cancel tag character with the tag identification
 character of the appropriate type. [...] To cancel any tag values of any
 type that may be in effect, use cancel tag without a prefixed tag
 identification character.

Continuing, it specifies:

> Inserting a bare cancel tag in places where only the language tag needs
 to be canceled could lead to unanticipated side effects if this text were
 to be inserted in the future into a text that supports more than one tag
 type.

However, the use of CANCEL TAG in flags is, in effect, a "bare cancel tag",
because it is not preceded by a tag identification character (it is only
preceded by tag component characters). The presence of an emoji flag in a
text may thus inadvertently cause the canceling of all applicable tags.

While the Standard currently only specifies one kind of tag (the language
tag, which is "strongly discouraged"), the use of CANCEL TAG in emoji flags
may cause issues if other kinds of tags are introduced in the future, or
for applications or protocols that make use of "private use" tags to signal
in-band information.

The simplest solution is to change the wording in sec. 23.9 to read:

> To cancel any tag values of any type that may be in effect, use cancel
 tag without a prefixed tag identification character or other tag
 character.

With this change, the CANCEL TAG character in the sequence

> U+1F3F4 U+E0066 U+E006F U+E006F U+E007F

has no effect and is ignored, while in the sequence

> U+1F3F4 U+66 U+6F U+6F U+E007F

the CANCEL TAG character will cancel all tags. This change prevents the
inadvertent canceling behavior of emoji tag sequences as described above.

Date/Time: Fri Jul 2 18:12:11 CDT 2021
Name: Mark Roberts
Report Type: Problems / Feedback about website
Opt Subject: Em and En Dash and Space

You you please consider adding a Q&A on this page:
https://www.unicode.org/faq/punctuation_symbols.html 
 
Question:  Do the widths of the en dash and en space need to half 
the widths of the em dash and em space?
Answer: (I believe the answer is yes--historically it has been.)

Although this PDF
https://www.unicode.org/charts/PDF/U2000.pdf 
implies that the en space is half an em space, it makes no mention of the 
relationship of an en dash to an em dash.  Furthermore, if an en dash is 
supposed to be half an em dash, the glyphs in that same PDF show that the 
en dash to be drawn slightly greater than half an em dash.

I really hope you will address this issue.  It comes up frequently with 
font designers.

Thank you.

Date/Time: Tue Jul 6 23:53:00 CDT 2021
Name: J Andrew Lipscomb
Report Type: Public Review Issue
Opt Subject: 14.0.0β issues

(Note: This report is actually about document L2/21-106, not the 14.0 beta.)

These are all in the text accompanying the code charts for Basic Latin and the Latin-1 Supplement.
1. (.) Canadian syllabics full stop is 166E, not 16EE.
2. (:) Tricolon is 205D, not 295D.
3. (C) Degree Celsius is 2103, not 2013.
4. Sections on \, °, x, X, q, and ß have stray text.

Other Reports

Date/Time: Thu Jul 8 20:20:24 CDT 2021
Name: Paul Holder
Report Type: Other Question, Problem, or Feedback
Opt Subject: Date encoding

Since there is forever a fight over the "correct" way to encode/display a date, 
it seems like Unicode should standardize it.  This way a user application can 
encode a date in a specified way, and user agents can display it in whatever 
way an end user feels motivated.