The sections below contain comments received on the open Public Review Issues and other feedback as of February 06, 2012, since the previous cumulative document was issued prior to UTC #129 (October 2011). This document does not include feedback on moderated Public Review Issues from the forum that have been digested by the forum moderators; those are in separate documents for each of the PRIs. Gray items in the Table of Contents do not have feedback here.
182 Proposed Update UTS #18: Unicode Regular Expressions
207 Proposed Draft UTR #50, Unicode Properties for Vertical Text Layout (moderated)
208 Proposed Update UTR #36: Unicode Security Considerations
209 Proposed Update UTS #39: Unicode Security Mechanisms
Feedback on Encoding Proposals
Closed Public Review Issues
Other Reports
No feedback at this time.
See the relevant forum.
Date/Time: Thu Jan 12 04:47:44 CST 2012
Contact: gerv@mozilla.org
Name: Gervase Markham
Report Type: Public Review Issue
Opt Subject: UTR#36: clarity suggestions
I have a couple of suggestions for improving the clarity of UTR #36, in particular section 2.9: http://www.unicode.org/reports/tr36/proposed.html#Security_Levels_and_Alerts "2. Highly Restrictive All characters in each identifier must be from a single script, or from the combinations: ASCII + Han + Hiragana + Katakana; ASCII + Han + Bopomofo; or ASCII + Han + Hangul No characters in the identifier can be outside of the Identifier Profile Note that this level will satisfy the vast majority of Latin-script users. 3. Moderately Restrictive Allow Latin with other scripts except Cyrillic, Greek, Cherokee Otherwise, the same as Highly Restrictive" My issues are: A) It refers to ASCII as "a script"; this is confusing, and I assumed it was a typo for "Latin". I am told it is not. Therefore, it should be explicitly mentioned that this is intentional, and made clear how "ASCII" is defined in Unicode codepoint terms. (Is there a Unicode property for it?) B) I am told that the intent of 3 is to allow Latin with any other _single_ script except Cyrillic, Greek or Cherokee - but this is not at all clear. I suggest using the following replacement text: "Allow Latin with any other single script except Cyrillic, Greek or Cherokee." Hope that helps, Gerv
Date/Time: Mon Jan 30 12:50:28 CST 2012
Contact: patrick.jones@icann.org
Name: Patrick Jones
Report Type: Public Review Issue
Opt Subject: UTR #36: Unicode Security Considerations
(Note: Filed in Edcom TRAC for Mark) In the proposed update to UTR #36, some of the terminology and links need to be updated. The term for ICANN in the References section should be updated. The latest version of the IDN Guidelines is version 3.0, and can be found at http://www.icann.org/en/topics/idn/implementation-guidelines.htm. ICANN's informational page on IDNs is available at http://www.icann.org/en/topics/idn/. IDNA2008 is referred in this document as a draft, UTR #36 should delete "draft" before the IDNA2008 specification. An additional informational RFC is RFC 5895, Mapping Characters for Internationalized Domain Names in Applications (IDNA) 2008, located at http://tools.ietf.org/html/rfc5895. You should also include RFC 6452, The Unicode Code Points and Internationalized Domain Names for Applications (IDNA) - Unicode 6.0, located at http://tools.ietf.org/html/rfc6452. UTR #36 may also want to reference the work currently underway in the ICANN's IDN Variant Project. The terminology section of the Integrated Issues Report (http://www.icann.org/en/topics/new-gtlds/idn-vip-integrated-issues-23dec11-en.pdf), while based on Unicode, also contains Terminology Used in Internationalization in the IETF, RFC 6365, and several additional terms introduced in the examination of the Variant Project Issues Reports. Please let me know if you need additional information. Best regards, Patrick Patrick L. Jones Sr. Mgr, Security IDN team ICANN
Date/Time: Mon Jan 30 13:00:15 CST 2012
Contact: patrick.jones@icann.org
Name: Patrick Jones
Report Type: Public Review Issue
Opt Subject: UTS #39: Unicode Security Mechanisms
(Note: Filed in Edcom TRAC for Mark) In the proposed update to UTS #39, some of the terminology and links need to be updated. As in the comments I submitted on UTS #36, in the references section at the bottom of the document, IDNA2008 is referred in this document as a draft, UTS #39 should delete "draft" before the IDNA2008 specification. An additional informational RFC is RFC 5895, Mapping Characters for Internationalized Domain Names in Applications (IDNA) 2008, located at http://tools.ietf.org/html/rfc5895. You should also include RFC 6452, The Unicode Code Points and Internationalized Domain Names for Applications (IDNA) - Unicode 6.0, located at http://tools.ietf.org/html/rfc6452. UTS #39 may also want to reference the work currently underway in the ICANN's IDN Variant Project. The terminology section of the Integrated Issues Report (http://www.icann.org/en/topics/new-gtlds/idn-vip-integrated-issues-23dec11-en.pdf), while based on Unicode, also contains Terminology Used in Internationalization in the IETF, RFC 6365, and several additional terms introduced in the examination of the Variant Project Issues Reports. This report also contains a section on visual similarity cases and whole-string issues which may be of use with UTS #39. Please let me know if you need additional information. Best regards, Patrick Patrick L. Jones Sr. Mgr, Security IDN team ICANN
Date/Time: Thu Nov 10 02:33:11 CST 2011
Contact: verdy_p@wanadoo.fr
Name: Philippe Verdy
Report Type: Public Review Issue
Opt Subject: UAX#29: word breaks with hiragana and voiced marks
I'd like to renew an old feedback I made about word breaks with hiragana and voiced marks in an UAX#29 PRI in... 2007. Because absoluetly nobody seems to have replied to this feedback, and visibly some characters that are used in both hiragana and katakana are not treated consistently as they should (for example with differences between normal and halfwidth variants). See http://unicode.org/mail-arch/unicode-ml/y2007-m08/0091.html Quoting the message: This UAX treats KATAKANA specially, to avoid breaking between two Katakana letters, but still break between hiragana. However, this is probably not true for every thing, notably in the sequence of an Hiragana letter and a voiced/semi voiced mark: U+309B (゛) KATAKANA-HIRAGANA VOICED SOUND MARK U+309C (゜) KATAKANA-HIRAGANA SEMI-VOICED SOUND MARK and possibly other characters currently listed in the Katakana value in table 3: U+3031 (〱) VERTICAL KANA REPEAT MARK U+3032 (〲) VERTICAL KANA REPEAT WITH VOICED SOUND MARK U+3033 (〳) VERTICAL KANA REPEAT MARK UPPER HALF U+3034 (〴) VERTICAL KANA REPEAT WITH VOICED SOUND MARK UPPER HALF U+3035 (〵) VERTICAL KANA REPEAT MARK LOWER HALF U+30A0 (゠) KATAKANA-HIRAGANA DOUBLE HYPHEN U+30FC (ー) KATAKANA-HIRAGANA PROLONGED SOUND MARK U+FF70 (ー) HALFWIDTH KATAKANA-HIRAGANA PROLONGED SOUND MARK U+FF9E (゙) HALFWIDTH KATAKANA VOICED SOUND MARK U+FF9F (゚) HALFWIDTH KATAKANA SEMI-VOICED SOUND MARK Do really word break occur between Hiragana letters and these marks coded after them (note that Hiragana letters are excluded from "Aletter" in table 3) ? If not, then (1) the list of characters above should better be listed under a separate value (say "ExtendKana"), and removed from Katakana in table 3. (2) a new value "Hiragana" should be created for Hiragana letters in table 3, like this: Katakana script="KATAKANA" (rewritten first row in table 3) Hiragana script="HIRAGANA" (new inserted row in table 3) ExtendKana (the list of characters above) (new row in table 3) (3) the existing rule WB13 (Katakana × Katakana) should be rewritten equivalently as: WB13. (Katakana | ExtendKana) × (Katakana | ExtendKana) (4) the following subrules WB13a and WB13b rewritten equivalently as: WB13a. (ALetter | Numeric | Katakana | ExtendKana | ExtendNumLet) × ExtendNumLet WB13b. ExtendNumLet × (ALetter | Numeric | Katakana | ExtendKana) (5) Another subrule should be added: WB13c. (Hiragana | ExtendKana) × ExtendKana No other change is needed, because word break will still occur either between two Hiragana letters, or after an ExtendKana and before a Hiragana letter, in the next rule: WB14. Any ÷ Any Or am I missing something?
Date/Time: Mon Nov 7 17:36:02 CST 2011
Contact: markus.icu@gmail.com
Name: Markus Scherer
Report Type: Error Report
Opt Subject: Bad @missing line in DerivedNumericValues.txt
DerivedNumericValues.txt has the following @missing line: # @missing: 0000..10FFFF; ; NaN It should be corrected to # @missing: 0000..10FFFF; NaN; ; NaN The format is documented as range;nv-as-decimal;nt-was-removed;nv-as-fraction and the current @missing line is missing the nv-as-decimal field, placing the NaN into the nt-was-removed field. This bug is in every version since 5.1.0 when the @missing field was first added. It is still in 6.1 beta (DerivedNumericValues-6.1.0d9.txt). Rather than (or in addition to) changing it here, it would be best to follow the suggestion in L2/11-358 "Parsing the UCD" A.1.a and add all @missing lines for properties with non-null defaults into PropertyValueAliases.txt, with consistent syntax. (See current examples in that file.) If DerivedNumericValues.txt does not get fixed, it should be noted in the errata and documented in the file's own header.
Date/Time: Tue Nov 8 00:30:19 CST 2011
Contact: markus.icu@gmail.com
Name: Markus Scherer
Report Type: Error Report
Opt Subject: more on UCD @missing & L2/11-358
1. I cannot find a @missing line with the default value for gc=General_Category, not in the UCD files nor in http://www.unicode.org/L2/L2011/11358-ucd-parsing/ExtraPropertyValueAliases.txt 2. One more comment on L2/11-358 "Parsing the UCD" A.1.a: I would really like to see all of the @missing lines in PropertyValueAliases.txt. Reason: Some properties (e.g., dt & nt) can be easily parsed from other files (e.g., UnicodeData.txt) which makes it pointless to parse their dedicated DerivedXyz.txt files except to get the @missing value. For backward compatibility, lines that are already elsewhere could be duplicated here. Some properties already have two @missing lines in the UCD (e.g., ea & lb). 3. [MOOT - Editorial Committee already handled point #3.] 4. L2/11-358 A.8 says "In PropertyValueAliases, all but ccc have the same field order. Not sure how to do this, but it would be less ugly to parse if it had the same format!" -> I recommend against changing this. ICU, and likely other implementations, has always ignored the ccc-specific syntax comment and treated the numeric values as the short names, and the short words as the "long names". The numeric value is used and listed practically everywhere anyway, especially since the "fixed" values do not have any names listed. 5. L2/11-358 B & C As a parser implementer, I much prefer fewer files and each with range;property-name;value like in DerivedCoreProperties.txt rather than property-specific files. And UnicodeData.txt does not fit the newer files' format but it's still pretty easy to parse so I don't see much need to change that at this point. FYI: For ICU, I just wrote a Python script that pre-parses the UCD (yes, yet another UCD parser) and generate a combined .txt file with all of the data relevant for ICU (using key-value pairs) so that I can then substantially simplify the binary-generating C code. Therefore, I have very recent experience writing yet another parser 6. I don't see how the CaseFolding.txt "T" mappings can be represented in properties; C+S go into scf, C+F go into cf, but where do T mappings go? (ICU so far has parsed CaseFolding.txt without worrying about formal properties for its values. It's getting interesting when recasting the data into a different form. I don't see these in the UCD XML either.) 0049; T; 0131; # LATIN CAPITAL LETTER I 0130; T; 0069; # LATIN CAPITAL LETTER I WITH DOT ABOVE Maybe add tcf=Turkic_Case_Folding? 7. Similarly, there does not seem to be any way to express conditional case mappings from SpecialCasing.txt in formal properties. FYI: ICU has always just stored an "is conditional" bit for characters with Turkic case foldings or conditional case mappings, and the runtime code has hardcoded conditions and mappings corresponding to the data files.
Date/Time: Mon Dec 19 06:33:37 CST 2011
Contact: ikeda@conversion.co.jp
Name: IKEDA Soji
Report Type: Public Review Issue
Opt Subject: Hangul tone marks
This was answered by the Ed Committee 2011/12/19
I realized that general caregory of hangul tone marks were changed from Mn (combining nonspacing) to Mc (combining spacing). I propose that they shall be Mn. UnicodeData.txt of 6.0.0: 302E;HANGUL SINGLE DOT TONE MARK;Mn;224;NSM;;;;;N;;;;; 302F;HANGUL DOUBLE DOT TONE MARK;Mn;224;NSM;;;;;N;;;;; UnicodeData-6.1.0d9.txt: 302E;HANGUL SINGLE DOT TONE MARK;Mc;224;L;;;;;N;;;;; 302F;HANGUL DOUBLE DOT TONE MARK;Mc;224;L;;;;;N;;;;; They are tone marks used in Old Korean texts which consist of vertical lines. These dots are placed on the left side of each character, not between the characters. Analogously on horizontal texts, cedilla (U+0327) protrudes into bottom side of base characters, but it might not be concerned as spacing. Thank you.
Date/Time: Fri Dec 9 02:11:59 CST 2011
Contact: jan.nijtmans@gmail.com
Name: Jan Nijtmans
Report Type: Error Report
Opt Subject: Four characters have their TOTITLE character set
to its default value
Consider the characters 01C5, 01C8, 01CB and 01F2 in UnicodeData-6.1.0d9.txt 01C5;LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON;Lt;0;L;<compat> 0044 017E;;;;N;LATIN LETTER CAPITAL D SMALL Z HACEK;;01C4;01C6;01C5 01C8;LATIN CAPITAL LETTER L WITH SMALL LETTER J;Lt;0;L;<compat> 004C 006A;;;;N;LATIN LETTER CAPITAL L SMALL J;;01C7;01C9;01C8 01CB;LATIN CAPITAL LETTER N WITH SMALL LETTER J;Lt;0;L;<compat> 004E 006A;;;;N;LATIN LETTER CAPITAL N SMALL J;;01CA;01CC;01CB 01F2;LATIN CAPITAL LETTER D WITH SMALL LETTER Z;Lt;0;L;<compat> 0044 007A;;;;N;;;01F1;01F3;01F2 They all have their TOTITLE entry set to the character itself. No other characters do that: It's the default value anyway. This change, which was done in Unicode 3 (In Unicode 2.1-update4 it was correct), is what caused Tcl bug 3444754, because Tcl's tooling was not adapted to that. See: https://sourceforge.net/tracker/?func=detail&atid=110894&aid=3444754&group_id=10894 However, because it is not consistent with other characters, which never list toupper/tolower/totitle entries pointing to itself, I would like to report that here anyway, proposing to replace those 4 entries to: 01C5;LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON;Lt;0;L;<compat> 0044 017E;;;;N;LATIN LETTER CAPITAL D SMALL Z HACEK;;01C4;01C6; 01C8;LATIN CAPITAL LETTER L WITH SMALL LETTER J;Lt;0;L;<compat> 004C 006A;;;;N;LATIN LETTER CAPITAL L SMALL J;;01C7;01C9; 01CB;LATIN CAPITAL LETTER N WITH SMALL LETTER J;Lt;0;L;<compat> 004E 006A;;;;N;LATIN LETTER CAPITAL N SMALL J;;01CA;01CC; 01F2;LATIN CAPITAL LETTER D WITH SMALL LETTER Z;Lt;0;L;<compat> 0044 007A;;;;N;;;01F1;01F3; Regards, Jan Nijtmans
Date/Time: Thu Dec 29 17:20:48 CST 2011
Contact: corporate@khwilliamson.com
Name: Karl Williamson
Report Type: Public Review Issue
Opt Subject: Potential issue with 6.1 NameAlias.txt
The proposed NameAlias.txt file omits 4 aliases that RL2.5 of UTS#18 says should be created. These all have parentheses in their names, so there is no danger of them accidentally being introduced as conflicting names. I don't know if the file should include these aliases that have long been called for in UTS#18. But I was surprised that they weren't there. The aliases are: 000A LINE FEED (LF) 000C FORM FEED (FF) 000D CARRIAGE RETURN (CR) 0085 NEXT LINE (NEL) ------- The follow up should be that the the UTC clarifies that the UTS#18 specification was broken, when it asks for support of the Unicode 1.0 name field. What was meant was, and what got created now, is instead of a literal support for a LONG (short) format, both LONG and short alias were intended to be individually supported. In addition, it could be pointed out that: Given that parentheses don't enter into aliases, implementations are free to support this mixed format for compatibility with past bugs, without running the risk of introducing incompatibiities with future aliases. This could take the form of a note for R.L. 2.5 in a future revision of UTS#18 A./
Date/Time: Fri Jan 6 06:36:05 CST 2012
Contact: kent.karlsson14@telia.com
Name: Kent Karlsson
Report Type: Public Review Issue
Opt Subject: aliases in 6.1.release-cand.
I think that 1) The aliases should be listed thus: First the (one!?!) name used in the other datafiles, followed by other names. 2) Each alias name should be listed only once, no two (or more) identical (modulo the matching rules) names in the list.
Date/Time: Fri Jan 6 06:38:09 CST 2012
Contact: kent.karlsson14@telia.com
Name: Kent Karlsson
Report Type: Public Review Issue
Opt Subject: aliases in 6.1.release-cand.
For the diff files, one should have the principle of making them as small as possible, thus using "old" names rather than "new" names (aliases).
Date/Time: Thu Jan 26 12:17:35 CST 2012
Contact: markus.icu@gmail.com
Name: Markus Scherer
Report Type: Error Report
Opt Subject: difference between UCA_Rules_SHORT.txt & FractionalUCA_SHORT.txt: prefixes vs. contractions
FractionalUCA_SHORT.txt has the following 4 weights conditional on *prefixes* (see the lines with the | symbol): ... 0141; [3D, 05, 8F][, D0 3D, 05] 006C | 00B7; [, DB A9, 05] 006C | 0387; [, DB A9, 05] 0140; [3D, 05, 05][, DB A9, 05] 004C | 00B7; [, DB A9, 05] 004C | 0387; [, DB A9, 05] 013F; [3D, 05, 8F][, DB A9, 05] ... UCA_Rules_SHORT.txt has *contractions* for these instead: <<< ㋏ / Td << l· = l· = ŀ <<< L· = L· = Ŀ The two representations should be equivalent. Therefore, these collation elements should rather be prefix-conditional in the rule form as well, as follows: ... (sequence of primary ignorables, up to the last one) << \u006C | \u00B7 = \u006C | \u0387 = \u004C | \u00B7 = \u004C | \u0387 and then expansions for the compatibility composites like this (or something equivalent) &\u006C\u00B7 = \u0140 &\u004C\u00B7 = \u013F Correspondingly, it would also be better to list these weights in FractionalUCA_SHORT.txt at the end of the primary ignorables rather than among the U+006C variations.
This feedback from John Cowan is carried forward from last time:
Date/Time: Thu Oct 27 02:23:23 CDT 2011
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: L2/11-373 Proposal to encode Linguistic Doubt Marks in the UCS
The proposal says: "In theory, [the proposed COMBINING QUESTION MARK ABOVE and BELOW] could be considered as glyph variant[s] of the same underlying character. However, there is no precedent of a combining character which has no fixed placement relative to the base letter, and especially there is no combining class indicating such a placement variation." Cedilla is a combining mark below, but U+0123 LATIN SMALL LETTER G WITH CEDILLA is rendered with an inverted cedilla above, despite its decomposition into "g" and COMBINING CEDILLA (not *COMBINING INVERTED CEDILLA ABOVE, which does not exist). Similarly, the IPA does not distinguish between diacritics above and below, and leaves it up to font designers when to use exceptionally placed diacritics.
Date/Time: Thu Jan 19 04:23:26 CST 2012
Contact: satai@akauri.com
Name: Alex Ostrovsky
Report Type: Feedback on an Encoding Proposal
Opt Subject: Encoding Georgian and Nuskhuri letters for Ossetian and Abkhaz
The document N3775 (L2/10-072, 2010-02-17) "Proposal for encoding Georgian and Nuskhuri letters for Ossetian and Abkhaz" proposes to add (among others) YN and AEN letters to the Mkhedruli chart of the Georgian block as the both Khutsuri charts (in the Georgian and the Georgian Supplement blocks). Since both Khutsuri YN and AEN letters are attested for Ossetian [Bible publications] only, the Khutsuri sections are named "Additional letters for Ossetian", while corresponding Mkhedruli sections are called "Additional letters for Mingrelian and Svan" (YN) and "Additional letters for Ossetian and Abkhaz" (AEN). However, Georgian YN is used for Mingrelian, Svan and Abkhaz as well, and nowadays Khutsuri is used by Georgian Orthodox Church. Thus, there is a potential to use Khutsuri YN in Mingrelian or Svan texts in future and it is much more probable than use of Khutsuri YN letter for Ossetian. Because of above, I would like to propose more neutral solutions: 1) Either rename "Addition letters for Ossetian" subhead into "Additional letters" one for both upper- and lower-case Khutsuri charts; 2) or split "Addition letters for Ossetian" subhead into "Addition letters" with YN and "Addition letters for Ossetian" with AEN. Personally I would incline to the second solution, because it keeps things arranged better and eliminates necessity in "reserved" codes in subheads. Thank you, Alex.
Date/Time: Thu Jan 26 13:41:45 CST 2012
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: Proposal to Add Three Characters to UTR #45
Perhaps these should be encoded not as ideographs but as symbols, in the manner of circled and parenthesized ideographs?
Date/Time: Thu Jan 26 15:07:44 CST 2012
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: Proposal to Encode Medieval East-Slavic Musical Notation in Unicode
The name TSEFAUT CLEF is dreadful. If it must be retained, at least make it CE-FA-UT CLEF. However, C-CLEF would be far better in my opinion.
Date/Time: Sun Feb 5 22:01:30 CST 2012
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: L2/12-072 Proposed UCD property: Script Identifier Status
Just to make sure this is fixed before it's frozen forever: it's "Aspirational", not "Asperational".
Date/Time: Sun Feb 5 22:19:17 CST 2012
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: L2/12-067 Characters with Multiple Accents (e.g.
Lithuanian), Recent Keyboard Standards, and Microsoft’s MSKLC
While it's true that the Windows dead-key model only allows the generation of a single diacritic, it is also true that MSKLC allows the creation of keys which directly generate Unicode combining characters. In this style, it would be possible to generate e-ogonek-acute by pressing the e key, the combining (not dead) ogonek key, and the combining (also not dead) acute key. This would send three characters to the application, which could then appropriately normalize them.