Comments on Public Review Issues

L2/14-277

Public Feedback Rolled Forward from July 2014

The sections below contain feedback that was not fully reviewed in the July 2014 meeting.

Feedback on UTS #18

Date/Time: Wed May 28 15:54:40 CDT 2014
Name: Richard Wordingham
Report Type: Error Report UTS #18
Opt Subject: Definition of Unicode Set in Unicode Regular Expressions

Unicode Technical Standard #18 'Unicode Regular Expressions' Revision 17 refers to Unicode 
sets, but does not define them.  I have been told that the definition is meant to be taken 
from UTS#35, the LDML specification, and that there ought to be a cross-reference to that 
definition.

Section 1.3 of UTS#18 contains two examples, 
"[\p{L}--QW]" and "[\p{Assigned}--\p{Decimal Digit Number}--a-fA-Fａ-ｆＡ-Ｆ]", 
which appear not to conform to the LDML syntax.  Further details are given at 
http://unicode.org/cldr/trac/ticket/7507 .

Date/Time: Mon Jul 14 00:05:39 CDT 2014
Name: Karl Williamson
Report Type: Error Report
Opt Subject: UTS18 typo


The final line in Section 1.2 should be
\p{Script_Extensions=Katakana}
NOT \p{Script_Extensions=Hiragana}

Date/Time: Fri Jun 13 22:36:38 CDT 2014
Name: Dmitry S.
Report Type: Error Report
Opt Subject: Typo in paragraph 3.6 of UTS #18 Unicode Regular Expressions


Hello, In section "3.6 Context Matching"
http://www.unicode.org/reports/tr18/#Context_Matching there is a typo in the
table with examples: the last column of the last two rows contains a string
"ca not" which should be corrected to "cannot".

Thanks,
Dmitry S.

Feedback on UAX #31

Date/Time: Sat Jun 7 14:23:13 CDT 2014
Name: Dmitry S.
Report Type: Error Report
Opt Subject: Possible typo in UAX #31

Hello,
In http://www.unicode.org/reports/tr31/ clause R7 says:

"R7 Filtered Case-Insensitive Identifiers
To meet this requirement, an implementation shall specify either simple or full case 
folding, and adhere to the Unicode specification for that folding. Except for identifiers 
containing excluded characters, allowed identifiers must be in the specified Normalization Form."

Is a Normalization Form truly meant here or is it a case-folding form?

Thanks,
Dmitry S.

Date/Time: Wed Jun 11 18:50:32 CDT 2014
Name: Norbert Lindenberg
Report Type: Error Report
Opt Subject: Inconsistency wrt/ variation selectors in UAX 31

Unicode Standard Annex 31, UNICODE IDENTIFIER AND PATTERN SYNTAX, is 
inconsistent in its description of variation selectors:

- Section 2.3 describes the risks associated with variation selectors 
(and other default-ignorable characters), and says “Variation selectors ... 
are not included in the default identifier syntax”, and “default-ignorable 
characters are normally excluded from Unicode identifiers”.

- Section 2, however, includes all nonspacing marks into ID_Continue, and 
does nothing to exclude variation selectors, which are nonspacing marks. 
And indeed, DerivedCoreProperties.txt does have the entries

180B..180D    ; ID_Continue # Mn   [3] MONGOLIAN FREE VARIATION SELECTOR ONE..MONGOLIAN FREE VARIATION SELECTOR THREE
FE00..FE0F    ; ID_Continue # Mn  [16] VARIATION SELECTOR-1..VARIATION SELECTOR-16
E0100..E01EF  ; ID_Continue # Mn [240] VARIATION SELECTOR-17..VARIATION SELECTOR-256

Feedback on UTS #10

Date/Time: Tue Jun 17 14:46:22 CDT 2014
Name: Dmitry S.
Report Type: Error Report
Opt Subject: Typo in UTS #10 Unicode Collation Algorithm


Hello,
There is a typo in section "3.8.1 Default Values" of UTS #10 Unicode Collation Algorithm 
(both 6.3.0 and 7.0.0): in the last sentence of the first paragraph it is written as follows:
"The unmarked characters will a3) equal to MIN3."
It seems that this should be corrected to the following: "The unmarked characters will 
have a3 equal to MIN3."

Thanks,
Dmitry S.

Date/Time: Wed Jun 18 15:40:40 CDT 2014
Name: Dmitry S.
Report Type: Error Report
Opt Subject: Possible error in UTS #10 Unicode Collation Algorithm

Hello,
in UTS #10 Unicode Collation Algorithm version 7.0.0 clause S2.1.2 
(http://www.unicode.org/reports/tr10/#S2.1.2) there seems to be an error
in a note below the clause:

"Note: A non-starter in a string is called blocked if there is another non-starter
of the same canonical combining class or zero between it and the last character of 
canonical combining class 0."

The "... non-starter of the same canonical combining class OR ZERO..." part seems 
erroneous to me because of the following:

1) UAX #15 http://www.unicode.org/reports/tr15/#Description_Norm defines non-starter 
as follows: "Most characters (including all non-combining marks) have a Canonical_Combining_Class 
value of zero, and are unaffected by the Canonical Ordering Algorithm. Such characters 
are referred to by a special term, starter. Only the subset of combining marks which have 
non-zero Canonical_Combining_Class property values are subject to potential reordering by 
the Canonical Ordering Algorithm. Those characters are called non-starters."

2) D107 Starter definition in the Unicode Standard: "D107 Starter: Any code point (assigned 
or not) with combining class of zero (ccc=0)."

The latter excerpts imply that a non-starter cannot have Canonical_Combining_Class value of 
zero (ccc=0) which stated otherwise in the note mentioned.

Thanks,
Dmitry S.

Analysis of the above report by Ken Whistler, 2014/06/18:

O.k., yes, this *is* a problem in wording, and it is non-trivial to
fix.

The note in question goes at least back to Version 4.0 of UTS #10,
although its position in the text migrated a bit later on. In the
UTS #10 4.0 version, it is:

Note: A combining mark in a string is called blocked if there is 
another combining mark of the same canonical combining class or zero 
between it and the last character of canonical combining class 0.

right below Step 2 in Section 4.2. It logically refers to Step 2.1.2,
which is where the note was later moved.

Then a comedy of errors ensues. In later versions of the text,
the note was updated by replacing "combining mark" with "non-starter",
without adjusting the text "or zero" correctly.

But wait! It gets worse. This text, which was derived from the 4.0 version
of UAX #15, where it defined starter for normalization, was not then
adjusted for Corrigendum #5 (from February, 2005!), which inserted the
wording "or higher" in the definition of blocked in UAX #15. And disconnected
as it was, it then certainly did not follow the later move of all the
definitions related to normalization *out* of UAX #15 and into Chapter 3
of the core spec (as of Version 5.2.0). And when they went into Chapter 3,
the wording for "starter" was essentially unchanged, but the wording
for "blocked" got a complete overhaul.

So my conclusion is that all of the wording about starter and blocked in
UTS #10 needs a serious update, to make correct references to the
*current* definitions in Chapter 3, rather than using ad hoc, out-of-date
definitions from 2005 derived from a long-superseded version of UAX #15.
Doing *that* will require some significant work on this section of the
text.

--Ken

Feedback on Other UAXes

Date/Time: Thu Jun 19 11:18:19 CDT 2014
Name: Addison Phillips
Report Type: Error Report
Opt Subject: Bad example in Figure 2, UAX#15

Figure 2 in UAX#15 (Normalization Forms) contains examples of different types
of "compatibility equivalence". The second line in this table is for "breaking
differences" and shows the hyphen-minus character as the example. However, the
only example I can find in TUS or the UCD of a "breaking difference" that is a
case of compatibility decomposition (in fact, it is cited in Chapter 2 of TUS)
is between U+00A0 (non-breaking space) and regular space.

While it's really difficult to illustrate different kinds of space characters
in a table, perhaps using a placeholder ("NBSP", "(non-breaking space)", etc.)
might work? Or maybe add some attendent prose to explain the table?

Note: The term "breaking difference" appears nowhere else that I can find in
UAX15 or in the relevant sections of TUS related to compatibility
decomposition.

Date/Time: Sat Jun 21 19:05:39 CDT 2014
Name: Samuel Bronson
Report Type: Error Report
Opt Subject: UAX #11: refers to biwidth fonts as "legacy"

In UAX#11, you say:

>> An important class of fixed-width legacy fonts contains glyphs of just two widths, 
with the wider glyphs twice as wide as the narrower glyphs.
I don't think it's correct to think of all such fonts as "legacy": such fonts tend to be 
popular with programmers, and I get the impression that, say, Japanese people usually like 
text to be typeset on a grid, too.

(Granted, the ones that make characters fullwidth *just* because they are encoded using 
two bytes in some encoding or other are a bit silly.)

If we could only get sensible wcwidth() values even for latin/punctuation/math characters and 
make the fonts to match, we'd *really* have something ... say, making EM DASH perceptibly wider 
than HYPHEN-MINUS?

Date/Time: Mon Jul 14 15:29:43 CDT 2014
Name: Markus Scherer
Report Type: Error Report
Opt Subject: UAX #38 kDefaultSortKey should distinguish traditional vs. simplified radicals

UAX #38 says:
2.1 Database design
kDefaultSortKey
"Bits 23-30 are the character’s KangXi radical number used [...] The difference 
between simplified and traditional radical is ignored."

This appears to be incorrect: The Han code chart
(http://www.unicode.org/charts/PDF/U4E00.pdf) shows that the forms of the
radicals are distinguished. For example, the characters with radical 120
(silk) are grouped together, and followed by the group of those with radical
120' (silk/C-simplified). See the chart at U+7CF8 and U+7E9F.

I expect that most if not all of the main Unihan block (4E00..9FFF) should
follow the kDefaultSortKey order. If this expectation is not intended to be
true, it should be documented for kDefaultSortKey. (I assume that possible
exceptions would be due to corrections of the Unihan data since the original
allocation.)

I suggest to either restate the default sort key as something other than int
bit fields (with the added distinction), or else using unsigned int (32-bit)
or long (64-bit) bit fields, adding one bit for traditional (0) vs. simplified
(1).

Given the existing action items for kDefaultSortkey ([139-A19a], [139-A21], 
see http://www.unicode.org/review/pri266/feedback.html)
I suggest to simplify it as follows:

Use a 64-bit integer with a less dense and therefore less error-prone encoding:

Bits 20.. 0  code point (avoids complications re [139-A19a])
Bit  23      set to 0 if the code point is U+4E00..U+FFFF,
             else set to 1
             ([139-A21], UCA implicit weights BASE FB40 vs. FB80)
Bits 29..24  residual stroke count (0..63)
Bit  30      set to 0 if traditional radical form (e.g., 120),
             set to 1 if simplified (120')
Bits 39..32  radical number (1..214)

Date/Time: Thu Jul 31 22:00:08 CDT 2014
Name: Markus Scherer
Report Type: Public Review Issue
Opt Subject: WD UTR #51 Unicode Emoji

The <title> says "UTS #51". It's not a UTS. Please change to "Working Draft UTR #51".

Section 1 Introduction is good, but I feel strongly that the section on Longer
Term Solutions should follow right after, rather than late in the document.

The document points to at least one doc in unicode.org/~scherer/ -- we should
copy that into a permanent location, for example reports/tr51/.

I suggest deleting 1.2 Goals. It duplicates some of the ToC; it says that the
material is subject to change (as usual); and the last sentence "This document
does not discuss..." should be merged into the Summary at the top which
partially contradicts it.

5 Sorting -- I am personally a bit skeptical about the need for sophisticated
sorting *among* symbols, including Emoji.

6 Searching -- this is useful information, but very different from "search" as
in UTS #10, for example, and it covers a variety of methods. This makes the
heading misleading. Please rename to "Input Methods" or "Selection Methods" or
similar.

Data charts: It would be useful to repeat the column headings once in a while,
at least in long, multi-column tables as in full-emoji-list.

Other Reports

Date/Time: Fri Jun 20 13:12:37 CDT 2014
Name: Roozbeh Pournader
Report Type: Error Report
Opt Subject: Glyph for U+1F44E THUMBS DOWN SIGN potentially wrong

The glyph for U+1F44E THUMBS DOWN SIGN may better show the back of the hand, as it's 
actually very hard to make such a gesture as shown.

Looking at the source glyphs at L2/09-027R2
(http://www.unicode.org/L2/L2009/09027r2-emoji-backgrnd.pdf),
it appears that the SoftBank glyph shows the back of the hand for
this character, while KDDI shows the front.

(From https://code.google.com/p/android/issues/detail?id=71948)

Date/Time: Tue Jun 24 09:22:05 CDT 2014
Name: Daniel Klein
Report Type: Other Question, Problem, or Feedback
Opt Subject: Normalisation of Indic scripts

Hi!

I was normalising some text into Form D with mixed Latin and Sinhala
characters and I was surprised that the Sinhala mark for "o" was decomposed
into "e" and "aa" (which is how it's typed on a Sinhala typewriter). I realise
that the character looks exactly like the other two combined but they don't
render the same as two characters (the combining ring is present) and have a
very different phonological meaning. e.g. කොළ (ක + ො + ළ) "kola" (green) &
කොළ (ක + ෙ + ‍ා + ළ) an impossible spelling (and probably pronunciation) of
"keaala" (no such word in Sinhala).

I checked on http://www.unicode.org/charts/normalization/chart_Sinhala.html
and noticed three other characters, too.

It seems to me the same as decomposing "d" into "cl" because if you combine
them they look the same. Also, "℅" does not become "c/o" in Form D, only in
Form KC, as well as other related symbols. I'm not sure that these Sinhala
characters should ever be decomposed, even in Form KD as it changes the
spelling, meaning, appearance and pronunciation of the words they appear in.

I had a quick look at Tamil and noticed the same thing. I would imagine that
this is the case for most Indic scripts in Unicode (almost all write "o" as a
combination of a preceding "e" and a following "aa").

Even more problematic is ෝ "oo" as ‍ා + ් never combine except with ‍ෙ. කෝ (ක
+ ෝ) vs කෝ (ක + ෙ + ා + ්).

If, however, you think I am wrong (there must have been a reason for doing it
this way) I would love to know the rationale. The only thing I can think of is
to maintain compatibility with proprietary encodings that don't have a
separate character for "o" but render all characters as they appear visually
but this seems like a bad idea to me as the text should be converted to
Unicode correctly in the first place.

Regards,

Daniel

// Addendum, July 20:

Hi Rick,

I happened to find the following in NamesList.txt:
@ Two-part dependent vowel signs
@+ These vowel signs have glyph pieces which stand on both
sides of the consonant; they follow the consonant in logical order, and
should be handled as a unit for most processing.
0DDC SINHALA VOWEL SIGN KOMBUVA HAA AELA-PILLA
= sinhala vowel sign o
: 0DD9 0DCF
0DDD SINHALA VOWEL SIGN KOMBUVA HAA DIGA AELA-PILLA
= sinhala vowel sign oo
: 0DDC 0DCA
0DDE SINHALA VOWEL SIGN KOMBUVA HAA GAYANUKITTA
= sinhala vowel sign au
: 0DD9 0DDF

The important bit is "should be handled as a unit for most processing".
I believe that the current behaviour of normalising these characters
into their lookalikes goes against this statement.

Cheers,