L2/18-274

Comments on Public Review Issues
(July 24 - Sept 14, 2018)

The sections below contain links to permanent feedback documents for the open Public Review Issues as well as other public feedback as of September 14, 2018, since the previous cumulative document was issued prior to UTC #156 (July 2018). Some items in the Table of Contents do not have feedback here.

Contents:

The links below go directly to open PRIs and to feedback documents for them, as of September 14, 2018.

Issue Name Feedback Link
379 Draft UAX #44, Unicode Character Database (feedback) No feedback at this time
378 Draft UTR #53, Unicode Arabic Mark Rendering (feedback) No feedback at this time

The links below go to locations in this document for feedback.

Feedback to UTC / Encoding Proposals
Feedback on UTRs / UAXes
Error Reports
Other Reports

Note: The section of Feedback on Encoding Proposals this time includes:
L2/15-121R2  L2/17-373R  L2/18-282 

 


Feedback to UTC / Encoding Proposals

Date/Time: Wed Sep 5 13:33:05 CDT 2018
Name: Eduardo Marín Silva
Report Type: Feedback on an Encoding Proposal L2/18-282
Opt Subject: Encoding model for a newly proposed character

Document: L2/18-282 proposes a new character for the Adlam script, in which
they argue that none of the currently encoded characters satisfy the needs
for the Adlam script, those are:

* Do not induce word or line break opportunities.
* Have Lm general category.
* Have an straight (not hooked) glyph.
* Have right-to-left directionality.
* Have a "transparent" joining class.

I agree that this requirements are sufficient to merit the separate encoding
of a new character, however the currently proposed character has the script
property for Adlam, when it can be argued that other orthographies in the
future may require such character; similar to the way the ARABIC TATWEEL,
was only encoded for the Arabic script but its use quickly expanded to other
right-to-left joining scripts.

As such I suggest encoding this character as a generic character, with the
Common script property, with the name RIGHT-TO-LEFT MODIFIER LETTER
APOSTROPHE and to reflect it's new nature be encoded in U+061D which is the
last unassigned slot in the Arabic block, with the annotation "used for
Adlam" as well as making the proper entry for this character in the script
property extensions.

Date/Time: Mon Sep 10 05:12:05 CDT 2018
Name: John Knightley
Report Type: Feedback on an Encoding Proposal
Opt Subject: Response to Proposal to Encode Two Vietnamese Alternate Reading Marks by Lee Collins

In Proposal to Encode Two Vietnamese Alternate Reading Marks by Lee Collins
(WG2 N4915, L2/17-373R), a somewhat simplified picture is painted of the two
reading marks. For example in the document the impression is given that the
reading marks are always placed on the left, but elsewhere the same author
talks about the the variant reading mark 个 "cá nháy" being present as the
top part of U+2B89A (V4-4078) 𫢚 which is formed form 个 over 衣 see
https://hc.jsecs.org/irg/ws2017/app/index.php?id=05027 (also in the recent
IRG document http://appsrv.cse.cuhk.edu.hk/~irg/irg/irg51/IRGN2309VietnamReview.pdf ).
The addition of the reading mark in this case also requires merging the bottom
stroke of the reading mark with the top stroke of the lower part. This
complex behaviour is one reason that indicates that encoding as a combining
character would not be appropriate would not be an appropriate way to
dealing with this reading mark, but rather continuing the existing parctice
of encoding via CJK unified ideographs would be best.

Some similar problems exist with the other reading mark proposed, as for
example shown in figure 4 of the UK response when the reading mark combines
with 外 and two strokes are merged.

The above, and other omissions, show the existing proposal is not mature,
and should not be allowed to proceed.

Feedback on UTRs / UAXes

Date/Time: Sun Jul 29 09:28:10 CDT 2018
Name: David Corbett
Report Type: Error Report
Opt Subject: Misleading phrasing about HYPHEN-MINUS in character names

UAX #34 says “The rule UAX34-R3 specifies that only medial HYPHEN-MINUS
characters are ignored in comparison. It is possible for a HYPHEN-MINUS
character to occur in initial position (following a SPACE) in a word in a
Unicode character name.” That makes it sound like the only possible
positions for HYPHEN-MINUS are medial and initial, but it can also be final.

Date/Time: Sat Jul 21 14:45:52 CDT 2018
Name: Karl Williamson
Report Type: Error Report
Opt Subject: Traditional and Simplified Han in UTS 39

Below is <1c9f4bf5-7589-a5a6-ddd2-dd4de4e5d0a0@ix.netcom.com> in which Asmus
lays out why a passage from UTS 39 should be retracted

The full excerpt from the UTS reads:

    Mark Chinese strings as “mixed script” if they contain both simplified (S) and 
    traditional (T) Chinese characters, using the Unihan data in the Unicode Character Database [UCD].
   
   	The criterion can only be applied if the language of the string is known to be Chinese. 
   	So, for example, the string “写真だけの結婚式 ” is Japanese, and should not be marked as 
   	mixed script because of a mixture of S and T characters.
   	Testing for whether a character is S or T needs to be based not on whether the character
   	has a S or T variant , but whether the character is an S or T variant.
   

There are several issues with this.

First and foremost, the definition of S and T variants is not something that
is universally agreed upon. The .cn, .hk or .tw registries are using a
definition of S and T variants that does not agree with the Unihan data in
many particulars. Therefore, using the Unihan data would result in false
positives. (And false negatives).

Second, there are many characters that are variants that are acceptable with
both "S" or "T" labels. You only have to look at the published Label
Generation Rulesets (or IDN tables) for these domains to see many examples.
And, as mentioned above, you cannot reverse engineer these tables from
Unihan data.

Third, the same domains mentioned have a policy of delegating up to three
label to the same applicant: a "traditional", "simplified" and a mixed label
matching the spelling of the label in the original application (for
situations where a mixed label is appropriate). In other words, certain
mixed labels are seen as appropriate.

Fourth, the Chinese ccTLDs all have a robust policy of preventing any other
mixed label that is a variant of the three from being allocated to an
unrelated party. If you "know" that the language has to be Chinese, because
the domain is a ccTLD, then at the same time the check is superfluous. Other
registries are not known to have similar policies, so for them additional
spoof detection may be useful --- however it is precisely those cases where
it's impossible to know whether a label is intended to be in the Chinese
language.

Fifth, generally the only thing that can be ascertained is that a label is
*not* in Chinese: by virtue of having Kana or Hangul characters mixed in.
However, the reverse is not true. You will find labels registered under .jp
that do not contain Hiragana or Katakana.

Sixth, for zones that are shared by different CJK languages, the state of
the art is to have a coordinated policy that prevents "random" variant
labels from coexisting in the registry. An example of this kind of effort is
being developed for the root zone. By definition, for the root zone, there
is no implied information about the language context, unlike the case for
the second level, where the presence of a ccTLD in the full domain name may
give a clue.

Seventh, attempting to determine whether a label is potentially valid based
on variant data (of any kind) is doomed, because actual usage is not limited
to "pure" labels. The variant mechanism is something that works differently
(in those registries that apply it): instead of looking at a single label,
the registry can implement "mutual exclusion". Once one variant label from a
given set has been delegated, all others are excluded (or in practice, all
but three, which are limited to the same applicant). Without access to the
registry data, you cannot predict which variants in a set are the "good
ones", and with access to the data, spoof labels are rejected and cannot be
registered.

In conclusion, my recommendation would be to retract this particular
passage.

A./

On 12/27/2017 1:31 PM, Karl Williamson via Unicode wrote:
    In UTS 39, it says, that optionally,
   
    "Mark Chinese strings as “mixed script” if they contain both simplified (S) 
    and traditional (T) Chinese characters, using the Unihan data in the Unicode 
    Character Database [UCD].
   
    "The criterion can only be applied if the language of the string is known to be Chinese."
   
    What does it mean for the language to "be known to be Chinese"?  Is this something 
    algorithmically determinable, or does it come from information about the input 
    text that comes from outside the UCD?
   
    The example given shows some Hirigana in the text.  That clearly indicates the 
    language isn't Chinese.  So in this example we can algorithmically rule out that its Chinese.
   
    And what does Chinese really mean here?

Date/Time: Sat Aug 4 10:55:32 CDT 2018
Name: David Corbett
Report Type: Error Report
Opt Subject: Coptic Epact Numbers in UAX #31

Table 4 “Candidate Characters for Exclusion from Identifiers” of UAX #31 should 
list \p{block=Coptic_Epact_Numbers}. It is like the other “inappropriate technical 
blocks” in that its only XID_Continue character is XID_Continue because it is a 
combining mark, but it is only useful when used with the other characters in 
the block, which are not XID_Continue.

Date/Time: Sat Aug 4 11:18:49 CDT 2018
Name: Manish Goregaokar
Report Type: Error Report
Opt Subject: IdentifierType.txt not consistent about Not_XID

https://www.unicode.org/Public/security/11.0.0/IdentifierType.txt 
categorizes code points by their identifier type from UTS 39
( https://www.unicode.org/reports/tr39/#Identifier_Status_and_Type 

Types are allowed to overlap.

It seems like there's inconsistency about Not_XID -- not all Not_XID code
points are tagged as such.

For example, U+0027 APOSTROPHE is just listed as Limited, but it should be
Limited Not_XID. The same goes for U+058A ARMENIAN HYPHEN and the other
punctuation characters there (except for MIDDLE DOT).

Date/Time: Sat Aug 11 11:50:13 CDT 2018
Name: David Corbett
Report Type: Error Report
Opt Subject: Grapheme_Cluster_Break of U+1B35 BALINESE VOWEL SIGN TEDUNG

UAX #29 assigns some spacing marks Grapheme_Cluster_Break=Extend instead of
SpacingMark “for canonical equivalence”. I infer the following rule: any
code point which occurs as the non-first code point in a non-Hangul
canonical decomposition must have Grapheme_Cluster_Break=Extend. (It would
be nice to explicitly state this rule in the annex.) There is one exception:
the decomposition of U+1B40 is <U+1B3E, U+1B35> and yet U+1B35 BALINESE
VOWEL SIGN TEDUNG has Grapheme_Cluster_Break=SpacingMark.

Date/Time: Sun Aug 12 18:42:31 CDT 2018
Name: David Corbett
Report Type: Error Report
Opt Subject: Context A2 in UAX #31 is too broad

UAX #31 defines context A2 for ZWNJ “in a conjunct context” as /$L $M* $V
$M₁* ZWNJ/. That allows ZWNJ at the end of an identifier, where it has no
visible effect. The regular expression should be /$L $M* $V $M₁* ZWNJ $L/
instead.

Defining the variables in terms of Indic_Syllabic_Category would minimize
false positives for the regex too.

Error Reports

Date/Time: Sun Jul 22 22:17:55 CDT 2018
Name: David Corbett
Report Type: Error Report
Opt Subject: U+166D CANADIAN SYLLABICS CHI SIGN

U+166D CANADIAN SYLLABICS CHI SIGN should have General_Category=So and
Terminal_Punctuation=No. It is a logogram for “Christ”, not a mark of
punctuation.

For example, http://www.evertype.com/standards/sl/a08.jpg shows the
beginning of the Epistle to Titus. “ᒋᓴᔅ ᙭” appears at the end of the first
line and in the middle of the eleventh. Comparing to an English translation
shows that they correspond to “Jesus Christ”, not “Jesus” with a punctuation
mark, and are not at “the end of textual units”.

Date/Time: Wed Jul 25 09:37:28 CDT 2018
Name: David Corbett
Report Type: Error Report
Opt Subject: Confusables for ARABIC MATHEMATICAL STRETCHED letters

The ARABIC MATHEMATICAL STRETCHED letters should be confusable not with the
basic Arabic letters but with the basic letters followed by alef. For
example, U+1EE61 should be confusable with the sequence ⟨U+0628 U+0627⟩.

Date/Time: Thu Aug 2 19:18:33 CDT 2018
Name: Ken Lunde
Report Type: Error Report
Opt Subject: Suggested additional kRSUnicode/kRSKangXi property values for U+20063 𠁣 and U+200DB 𠃛

Similar to U+29C0B 𩰋 and U+29C0A 𩰊 that use 191.-5 as their kRSUnicode and
kRSKangXi property values, please consider adding 169.-4 as additional
kRSUnicode and kRSKangXi property values for U+20063 𠁣 and U+200DB 𠃛, which
are arguably more correct, and more importantly, will make them *much*
easier to find among the nearly 90K CJK Unified Ideographs.

Date/Time: Tue Aug 7 15:00:48 CDT 2018
Name: David Corbett
Report Type: Error Report
Opt Subject: Hentaigana should have Identifier_Type=Obsolete

U+1B001 HIRAGANA LETTER ARCHAIC YE has Identifier_Type=Obsolete but the rest
of the hentaigana (U+1B002 to U+1B11E) have Identifier_Type=Recommended.
They are all obsolete.

Date/Time: Wed Aug 8 13:09:47 CDT 2018
Name: David Corbett
Report Type: Error Report
Opt Subject: Identifier_Type of U+05C7 HEBREW POINT QAMATS QATAN

U+05C7 HEBREW POINT QAMATS QATAN is a recent invention, so it should not
have Identifier_Type=Obsolete. (Identifier_Type=Uncommon_Use is still
appropriate.)

Date/Time: Wed Aug 8 13:34:14 CDT 2018
Name: David Corbett
Report Type: Error Report
Opt Subject: Inconsistent Identifier_Types for excluded scripts

Some old scripts have Identifier_Type=Exclusion|Obsolete whereas most just
have Identifier_Type=Exclusion. The former (Ogham, Runic, Old Italic,
Gothic, and Deseret) are not more obsolete than the latter, so all those
scripts should be Obsolete or none should.

Only the original Old Italic and Deseret repertoires have
Identifier_Type=Exclusion|Obsolete: the letters encoded later (U+1031F,
U+1032D..1032F, U+10426..10427, and U+1044E..1044F) have
Identifier_Type=Exclusion. This is even more inconsistent. (Coptic is
another script where some letters are Exclusion|Obsolete and some are
Exclusion, but the Obsolete ones really are obsolete, so no change is
necessary for Coptic.)

Date/Time: Wed Aug 8 20:43:44 CDT 2018
Name: David Corbett
Report Type: Error Report
Opt Subject: Inaccurate synthesis of U+11134 CHAKMA MAAYYAA

The section on Chakma in chapter 13 says “combinations of virama and maayyaa
following a consonant are not meaningful, as both kill the inherent vowel”.
Because maayyaa is also used as a gemination mark, the sequence ⟨maayyaa,
virama⟩ is meaningful.

Date/Time: Thu Aug 9 07:42:48 CDT 2018
Name: David Corbett
Report Type: Error Report
Opt Subject: Ambiguity in how to use Indic siyaq numerals

L2/15-121R2 describes a style of Indic siyaq numerals where a multiple of
lakhs or crores is rendered with glyphs resembling U+1EC95 INDIC SIYAQ
NUMBER TEN THOUSAND etc. instead of U+1EC7A INDIC SIYAQ NUMBER TEN etc. It
offers two possible representations: “This method of writing the tens of
lakhs may be mimicked by using the numbers for the ten thousands, whose
shapes resemble the modified tens. While this approach does not preserve the
semantic value of the number, it does offer a visual solution. [...] Another
method might be to produce the alternate display using contextual
substitutions in a font.” The Unicode Standard does not explain which
solution to use. It should, because implementers of fonts are likely to read
the proposal, and be confused, and create incompatible fonts.

I suggest using TEN THOUSAND for glyphs that look like TEN THOUSAND, even
when when they mean 10, because the encoding of Indic siyaq numerals is in
all other respects glyph-based.

Other Reports

Date/Time: Wed Aug 22 12:57:24 CDT 2018
Name: Stephen Davis
Report Type: Error Report
Opt Subject: Cherokee Nation Will Return

I noticed you branched out the Latin Extended category to subgroup D.  I
noticed that one of the characters (U+A7AE=«Latin Capital Letter SMALL
CAPITAL I») in that subgroup appears solidly like a syllable character in
the Cherokee block (U+13C6=«Cherokee Letter QUA») and, as some might have
pointed out in the past, there are similar visual matchups in the ASCII
Latin and Cyrillic blocks (maybe even a mockup or two in the Mathematical
Symbol categories.  I feel it would be appropriate to consider cross-
referencing the characters in the Cherokee block with their lookalikes in
other blocks so that future font files who decide to implement Cherokee
won't have to "reinvent the wheel" when just knocking off a corner or two
will suffice.  Thank you.

Date/Time: Sat Aug 18 23:24:57 CDT 2018
Name: Henrique Peron
Report Type: Other Question, Problem, or Feedback
Opt Subject: MS-DOS/IBM-DOS backwards compatibility

Good morning,

I have noticed that there's a block called "Latin Extended Additional" with
several precomposed vietnamese accented letters.

As I understand it, those are provided for backward compatibility purposes
with old DOS codepages and other 8-bit character sets while, nowadays,
vietnamese computer users rather type postcomposed chars by typing their
necessary latin letters and combining them with the necessary diacritic from
the block "Combining Diacritical Marks".

However, when it comes to the lithuanian language, such support is not
available on Unicode.

Lithuanian uses, along with ordinary latin letters, nine extra precomposed
latin latters with certain diacritical marks. However, they understand them
as letters in their own right, such as the spanish "ñ", which has its own
position on the alphabet between "N" and "O" and is not considered an
accented letter.

On those DOS days, there were the codepages CP776, CP777 and CP778, which
provided what were called precomposed letters for "accented lithuanian".

What is called "accented lithuanian" is the regular set of latin letters,
along with those aforementioned nine extra precomposed latin letters with,
eventually yet another diacritical mark. There are combinations like "LATIN
LETTER A WITH OGONEK AND ACUTE ACCENT" and "LATIN LETTER J WITH TILDE".

It poses a distinct situation: when combining the small letter "i" with the
acute, grave or tilde, the small letter "i" must retain the tittle, unlike
what happens with the "i" on other languages written with the latin script.

Last but not least, there are a few other exquisite chars on those three
codepages also seemingly used on dictionaries.

There's also a Wikipedia page
(https://en.wikipedia.org/wiki/Lithuanian_accentuation) which provides more
information. Such webpage shows that support for accented lithuanian is not
only needed for backward compatibility purposes, in the end.

Would the Unicode Consortium be interested in providing such support?

Date/Time: Mon Aug 27 08:16:37 CDT 2018
Name: François
Report Type: Other Question, Problem, or Feedback
Opt Subject: Ou ligature for greek

Hello,

I'm surprised the character ȣ isn't include as a greek character, even if
it's a ligature.

I know this character inclusion proposal has been rejected, but couldn't it
be worth reconsidering it?

Its use isn't that rare in modern greek graffiti or even in greek trade
marks (https://papadopoulou.gr/) or in ancient byzantine texts.

I can't see the rational of the inclusion of a character such as ff for latin
(not perceived by anyone as different from ff) but not of the inclusion of ȣ
whereas is clearly more different from ου.

Thank you !