The sections below contain feedback that was not fully reviewed in the July 2014 meeting.
UTS #18 Reports
Feedback on UAX #31
Feedback on UTS #10
Feedback on Other UAXes
Error Reports
Other Reports
Date/Time: Wed May 28 15:54:40 CDT 2014
Name: Richard Wordingham
Report Type: Error Report
UTS #18
Opt Subject: Definition of Unicode Set in Unicode Regular Expressions
Unicode Technical Standard #18 'Unicode Regular Expressions' Revision 17 refers to Unicode sets, but does not define them. I have been told that the definition is meant to be taken from UTS#35, the LDML specification, and that there ought to be a cross-reference to that definition. Section 1.3 of UTS#18 contains two examples, "[\p{L}--QW]" and "[\p{Assigned}--\p{Decimal Digit Number}--a-fA-Fa-fA-F]", which appear not to conform to the LDML syntax. Further details are given at http://unicode.org/cldr/trac/ticket/7507 .
Date/Time: Mon Jul 14 00:05:39 CDT 2014
Name: Karl Williamson
Report Type: Error Report
Opt Subject: UTS18 typo
The final line in Section 1.2 should be \p{Script_Extensions=Katakana} NOT \p{Script_Extensions=Hiragana}
Date/Time: Fri Jun 13 22:36:38 CDT 2014
Name: Dmitry S.
Report Type: Error Report
Opt Subject: Typo in paragraph 3.6 of UTS #18 Unicode Regular Expressions
Hello, In section "3.6 Context Matching" http://www.unicode.org/reports/tr18/#Context_Matching there is a typo in the table with examples: the last column of the last two rows contains a string "ca not" which should be corrected to "cannot". Thanks, Dmitry S.
Date/Time: Sat Jun 7 14:23:13 CDT 2014
Name: Dmitry S.
Report Type: Error Report
Opt Subject: Possible typo in UAX #31
Hello, In http://www.unicode.org/reports/tr31/ clause R7 says: "R7 Filtered Case-Insensitive Identifiers To meet this requirement, an implementation shall specify either simple or full case folding, and adhere to the Unicode specification for that folding. Except for identifiers containing excluded characters, allowed identifiers must be in the specified Normalization Form." Is a Normalization Form truly meant here or is it a case-folding form? Thanks, Dmitry S.
Date/Time: Wed Jun 11 18:50:32 CDT 2014
Name: Norbert Lindenberg
Report Type: Error Report
Opt Subject: Inconsistency wrt/ variation selectors in UAX 31
Unicode Standard Annex 31, UNICODE IDENTIFIER AND PATTERN SYNTAX, is inconsistent in its description of variation selectors: - Section 2.3 describes the risks associated with variation selectors (and other default-ignorable characters), and says “Variation selectors ... are not included in the default identifier syntax”, and “default-ignorable characters are normally excluded from Unicode identifiers”. - Section 2, however, includes all nonspacing marks into ID_Continue, and does nothing to exclude variation selectors, which are nonspacing marks. And indeed, DerivedCoreProperties.txt does have the entries 180B..180D ; ID_Continue # Mn [3] MONGOLIAN FREE VARIATION SELECTOR ONE..MONGOLIAN FREE VARIATION SELECTOR THREE FE00..FE0F ; ID_Continue # Mn [16] VARIATION SELECTOR-1..VARIATION SELECTOR-16 E0100..E01EF ; ID_Continue # Mn [240] VARIATION SELECTOR-17..VARIATION SELECTOR-256
Date/Time: Tue Jun 17 14:46:22 CDT 2014
Name: Dmitry S.
Report Type: Error Report
Opt Subject: Typo in UTS #10 Unicode Collation Algorithm
Hello, There is a typo in section "3.8.1 Default Values" of UTS #10 Unicode Collation Algorithm (both 6.3.0 and 7.0.0): in the last sentence of the first paragraph it is written as follows: "The unmarked characters will a3) equal to MIN3." It seems that this should be corrected to the following: "The unmarked characters will have a3 equal to MIN3." Thanks, Dmitry S.
Date/Time: Wed Jun 18 15:40:40 CDT 2014
Name: Dmitry S.
Report Type: Error Report
Opt Subject: Possible error in UTS #10 Unicode Collation Algorithm
Hello, in UTS #10 Unicode Collation Algorithm version 7.0.0 clause S2.1.2 (http://www.unicode.org/reports/tr10/#S2.1.2) there seems to be an error in a note below the clause: "Note: A non-starter in a string is called blocked if there is another non-starter of the same canonical combining class or zero between it and the last character of canonical combining class 0." The "... non-starter of the same canonical combining class OR ZERO..." part seems erroneous to me because of the following: 1) UAX #15 http://www.unicode.org/reports/tr15/#Description_Norm defines non-starter as follows: "Most characters (including all non-combining marks) have a Canonical_Combining_Class value of zero, and are unaffected by the Canonical Ordering Algorithm. Such characters are referred to by a special term, starter. Only the subset of combining marks which have non-zero Canonical_Combining_Class property values are subject to potential reordering by the Canonical Ordering Algorithm. Those characters are called non-starters." 2) D107 Starter definition in the Unicode Standard: "D107 Starter: Any code point (assigned or not) with combining class of zero (ccc=0)." The latter excerpts imply that a non-starter cannot have Canonical_Combining_Class value of zero (ccc=0) which stated otherwise in the note mentioned. Thanks, Dmitry S.
Analysis of the above report by Ken Whistler, 2014/06/18:
O.k., yes, this *is* a problem in wording, and it is non-trivial to fix. The note in question goes at least back to Version 4.0 of UTS #10, although its position in the text migrated a bit later on. In the UTS #10 4.0 version, it is: Note: A combining mark in a string is called blocked if there is another combining mark of the same canonical combining class or zero between it and the last character of canonical combining class 0. right below Step 2 in Section 4.2. It logically refers to Step 2.1.2, which is where the note was later moved. Then a comedy of errors ensues. In later versions of the text, the note was updated by replacing "combining mark" with "non-starter", without adjusting the text "or zero" correctly. But wait! It gets worse. This text, which was derived from the 4.0 version of UAX #15, where it defined starter for normalization, was not then adjusted for Corrigendum #5 (from February, 2005!), which inserted the wording "or higher" in the definition of blocked in UAX #15. And disconnected as it was, it then certainly did not follow the later move of all the definitions related to normalization *out* of UAX #15 and into Chapter 3 of the core spec (as of Version 5.2.0). And when they went into Chapter 3, the wording for "starter" was essentially unchanged, but the wording for "blocked" got a complete overhaul. So my conclusion is that all of the wording about starter and blocked in UTS #10 needs a serious update, to make correct references to the *current* definitions in Chapter 3, rather than using ad hoc, out-of-date definitions from 2005 derived from a long-superseded version of UAX #15. Doing *that* will require some significant work on this section of the text. --Ken
Date/Time: Thu Jun 19 11:18:19 CDT 2014
Name: Addison Phillips
Report Type: Error Report
Opt Subject: Bad example in Figure 2, UAX#15
Figure 2 in UAX#15 (Normalization Forms) contains examples of different types of "compatibility equivalence". The second line in this table is for "breaking differences" and shows the hyphen-minus character as the example. However, the only example I can find in TUS or the UCD of a "breaking difference" that is a case of compatibility decomposition (in fact, it is cited in Chapter 2 of TUS) is between U+00A0 (non-breaking space) and regular space. While it's really difficult to illustrate different kinds of space characters in a table, perhaps using a placeholder ("NBSP", "(non-breaking space)", etc.) might work? Or maybe add some attendent prose to explain the table? Note: The term "breaking difference" appears nowhere else that I can find in UAX15 or in the relevant sections of TUS related to compatibility decomposition.
Date/Time: Sat Jun 21 19:05:39 CDT 2014
Name: Samuel Bronson
Report Type: Error Report
Opt Subject: UAX #11: refers to biwidth fonts as "legacy"
In UAX#11, you say: >> An important class of fixed-width legacy fonts contains glyphs of just two widths, with the wider glyphs twice as wide as the narrower glyphs. I don't think it's correct to think of all such fonts as "legacy": such fonts tend to be popular with programmers, and I get the impression that, say, Japanese people usually like text to be typeset on a grid, too. (Granted, the ones that make characters fullwidth *just* because they are encoded using two bytes in some encoding or other are a bit silly.) If we could only get sensible wcwidth() values even for latin/punctuation/math characters and make the fonts to match, we'd *really* have something ... say, making EM DASH perceptibly wider than HYPHEN-MINUS?
Date/Time: Mon Jul 14 15:29:43 CDT 2014
Name: Markus Scherer
Report Type: Error Report
Opt Subject: UAX #38 kDefaultSortKey should distinguish traditional vs. simplified radicals
UAX #38 says: 2.1 Database design kDefaultSortKey "Bits 23-30 are the character’s KangXi radical number used [...] The difference between simplified and traditional radical is ignored." This appears to be incorrect: The Han code chart (http://www.unicode.org/charts/PDF/U4E00.pdf) shows that the forms of the radicals are distinguished. For example, the characters with radical 120 (silk) are grouped together, and followed by the group of those with radical 120' (silk/C-simplified). See the chart at U+7CF8 and U+7E9F. I expect that most if not all of the main Unihan block (4E00..9FFF) should follow the kDefaultSortKey order. If this expectation is not intended to be true, it should be documented for kDefaultSortKey. (I assume that possible exceptions would be due to corrections of the Unihan data since the original allocation.) I suggest to either restate the default sort key as something other than int bit fields (with the added distinction), or else using unsigned int (32-bit) or long (64-bit) bit fields, adding one bit for traditional (0) vs. simplified (1). Given the existing action items for kDefaultSortkey ([139-A19a], [139-A21], see http://www.unicode.org/review/pri266/feedback.html) I suggest to simplify it as follows: Use a 64-bit integer with a less dense and therefore less error-prone encoding: Bits 20.. 0 code point (avoids complications re [139-A19a]) Bit 23 set to 0 if the code point is U+4E00..U+FFFF, else set to 1 ([139-A21], UCA implicit weights BASE FB40 vs. FB80) Bits 29..24 residual stroke count (0..63) Bit 30 set to 0 if traditional radical form (e.g., 120), set to 1 if simplified (120') Bits 39..32 radical number (1..214)
Date/Time: Thu Jul 31 22:00:08 CDT 2014
Name: Markus Scherer
Report Type: Public Review Issue
Opt Subject: WD UTR #51 Unicode Emoji
The <title> says "UTS #51". It's not a UTS. Please change to "Working Draft UTR #51". Section 1 Introduction is good, but I feel strongly that the section on Longer Term Solutions should follow right after, rather than late in the document. The document points to at least one doc in unicode.org/~scherer/ -- we should copy that into a permanent location, for example reports/tr51/. I suggest deleting 1.2 Goals. It duplicates some of the ToC; it says that the material is subject to change (as usual); and the last sentence "This document does not discuss..." should be merged into the Summary at the top which partially contradicts it. 5 Sorting -- I am personally a bit skeptical about the need for sophisticated sorting *among* symbols, including Emoji. 6 Searching -- this is useful information, but very different from "search" as in UTS #10, for example, and it covers a variety of methods. This makes the heading misleading. Please rename to "Input Methods" or "Selection Methods" or similar. Data charts: It would be useful to repeat the column headings once in a while, at least in long, multi-column tables as in full-emoji-list.
Date/Time: Fri Jun 20 13:12:37 CDT 2014
Name: Roozbeh Pournader
Report Type: Error Report
Opt Subject: Glyph for U+1F44E THUMBS DOWN SIGN potentially wrong
The glyph for U+1F44E THUMBS DOWN SIGN may better show the back of the hand, as it's actually very hard to make such a gesture as shown. Looking at the source glyphs at L2/09-027R2 (http://www.unicode.org/L2/L2009/09027r2-emoji-backgrnd.pdf), it appears that the SoftBank glyph shows the back of the hand for this character, while KDDI shows the front. (From https://code.google.com/p/android/issues/detail?id=71948)
Date/Time: Tue Jun 24 09:22:05 CDT 2014
Name: Daniel Klein
Report Type: Other Question, Problem, or Feedback
Opt Subject: Normalisation of Indic scripts
Hi! I was normalising some text into Form D with mixed Latin and Sinhala characters and I was surprised that the Sinhala mark for "o" was decomposed into "e" and "aa" (which is how it's typed on a Sinhala typewriter). I realise that the character looks exactly like the other two combined but they don't render the same as two characters (the combining ring is present) and have a very different phonological meaning. e.g. කොළ (ක + ො + ළ) "kola" (green) & කොළ (ක + ෙ + ා + ළ) an impossible spelling (and probably pronunciation) of "keaala" (no such word in Sinhala). I checked on http://www.unicode.org/charts/normalization/chart_Sinhala.html and noticed three other characters, too. It seems to me the same as decomposing "d" into "cl" because if you combine them they look the same. Also, "℅" does not become "c/o" in Form D, only in Form KC, as well as other related symbols. I'm not sure that these Sinhala characters should ever be decomposed, even in Form KD as it changes the spelling, meaning, appearance and pronunciation of the words they appear in. I had a quick look at Tamil and noticed the same thing. I would imagine that this is the case for most Indic scripts in Unicode (almost all write "o" as a combination of a preceding "e" and a following "aa"). Even more problematic is ෝ "oo" as ා + ් never combine except with ෙ. කෝ (ක + ෝ) vs කෝ (ක + ෙ + ා + ්). If, however, you think I am wrong (there must have been a reason for doing it this way) I would love to know the rationale. The only thing I can think of is to maintain compatibility with proprietary encodings that don't have a separate character for "o" but render all characters as they appear visually but this seems like a bad idea to me as the text should be converted to Unicode correctly in the first place. Regards, Daniel // Addendum, July 20: Hi Rick, I happened to find the following in NamesList.txt: @ Two-part dependent vowel signs @+ These vowel signs have glyph pieces which stand on both sides of the consonant; they follow the consonant in logical order, and should be handled as a unit for most processing. 0DDC SINHALA VOWEL SIGN KOMBUVA HAA AELA-PILLA = sinhala vowel sign o : 0DD9 0DCF 0DDD SINHALA VOWEL SIGN KOMBUVA HAA DIGA AELA-PILLA = sinhala vowel sign oo : 0DDC 0DCA 0DDE SINHALA VOWEL SIGN KOMBUVA HAA GAYANUKITTA = sinhala vowel sign au : 0DD9 0DDF The important bit is "should be handled as a unit for most processing". I believe that the current behaviour of normalising these characters into their lookalikes goes against this statement. Cheers, Daniel
(Note: This came through the Unicode mail list:)
Date/Time: Thu Jul 10 12:09:22 CDT 2014
Name: Christian Lerch
Report Type: Error Report
Opt Subject: Coding error for age property in UCD
At least in versions 6.3.0 and 7.0.0 (haven't checked others) there is an inconsistent coding of the age property value of "Unassigned" in either the ucd file PropertyValueAliases.txt or in the ucdxml xml files. In the former the abbreviated name (2nd field) for value "Unassigned" is given as "NA". In the later, however, instead of having age="NA" you find age="unassigned", which has no entry in PropertyValueAliases.txt