The sections below contain comments received on the open Public Review Issues and other feedback as of January 26, 2010, since the previous cumulative document was issued prior to UTC #121 (November 2009).
150 Draft UTS #46: Unicode IDNA Compatible Preprocessing
151 Proposed Update UAX #44: Unicode Character Database
152 Proposed Update UAX #15: Unicode Normalization Forms
153 Proposal to Deprecate Five Character Properties Defined in UAX #44
154 Proposed Update UTR #36: Unicode Security Considerations
155 Proposed Update UTS #39: Unicode Security Mechanisms
Other Reports
Feedback on Encoding Proposals
Feedback TUS 5.2 and Charts
Closed Public Review Issues
Date/Time: Thu Nov 19 20:07:34 CST 2009
Contact: andrewc@vicnet.net.au
Name: Andrew Cunningham
Report Type: Public Review Issue
Subject: New draft Unicode specifications for IDNA and Security
Lately I have been considering the implications for IDNA and security created by current IT practices for the Myanmar script. The history of the Myanmar script on computers has been checkered, and that the prevalence of older computers and operating systems that can not handle complex rendering required for the Myanmar script has lead to a generation of web developers who have been using revamped glyph based encodings that have been implemented to leverage of applications and operating systems Unicode support.
Technically these encodings aren't Unicode, but will declare themselves as Unicode, and the average users is unlikely to understand the differences between these encodings and Unicode. This is the case with Burmese, and is increasingly the case with other minority Myanmar script languages. The most widely used font on Myanmar websites is Zawgyi, one of these pseudo-Unicode websites.
And fonts like Zawgyi could create confusion, and pose security risks.
Date/Time: Wed Dec 2 11:12:21 CST 2009
Contact: mfadl@eg.ibm.com
Name: Marwa Aboulfadl
Report Type: Public Review Issue
Subject: comments/questions regarding UTS#46
150 Draft UTS #46: Unicode IDNA Compatibility Processing
A) Questions and concerns
I noticed that section 6.1 IDNA 2008 characters was removed in draft 5 but preferred to send this comment (which applies on draft 4) as I am not sure if the section was temporarily (for off-line modifications) or finally removed. In this section, it is stated that Arabic tatweel character will be removed. I have several questions regarding this:
a- What is the rationale behind this? I thought may be it is because Tatweel can be confused with the dash in the punycode prefix "xn--" and "-" separating but I noticed that the Hyphen-minus character is added while it may cause the same confusion.
b- What I got is that this way Tatweel will be of type "disallowed", correct me if I am wrong. Hence, users will not able to register domain names containing this character, right? If not, then what the impact? And what is the effect of this when converting URL to punycode format and the Unicode --> PunyCode --> Unicode roundtrip?
Would you please support your answers with examples?
B) Editorial, typos and grammatical mistakes
1- In "Introduction" section, remove the word "is" from "This is was a significant burden on people using other characters."
2- In 1.3.1 "Deviations" section, in Table 1, I expect the URL under "IDNA2008 result" for ZWNJ to be "http://xn--mgba3gch31f060k.com", the one in draft 5 version has a missing "x". Am I missing something?
C) Suggestions
1- Add more examples especially for complex scripts (e.g. Arabic, Indic, ...). The impact on the languages using these scripts needs more elaboration
2- Consider adding illustrative chart(s) to demonstrate all the stages in which the URL passes starting from being in pure Unicode format and until reaching the stage of being in pure ASCII format (and backward to show the roundtrip). It would be very beneficial to indicate the mechanisms and processing involved with "lookup" and "display" in these stages as well. This would be very helpful.
Date: Tue, 1 Dec 2009 16:47:59 -0800 (PST)
From: Asmus Freytag
Subject: Re: UAX #44: proposed update draft updated
Ken,
When you don't know where to look for a property, it can be hard to find (in this case, I needed to search, but I had the name).
Therefore, two suggestions:
1.) The UAX could really use another table:
Properties in alphabetical order with links to the main description in
section 5.1
==> in order to not blow up section 5.1, I you could make that some form
of appendix (or simply a section at the end) and put a link to it below
each of the big tables.
2.) Clicking on "Property Table" in the TOC gets me to the small table
of property types.
==> suggest that you split section 5.1 into section 5.1 Property Types
and section 5.2 Property Table (and renumber later sections).
A./
Date/Time: Wed Jan 6 14:22:13 CST 2010
Contact: cfaerber@cfaerber.name
Name: Claus Färber
Report Type: Public Review Issue
Subject: UAX#15, Section 19
Hi, I've noticed to problems with section 19 of UAX#15:
I.
In section 19, the last paragraph (starting “It is straightforward…”) contradicts the following sections 19.2 through 19.5. It predates PRI#29 and should probably just be deleted.
II.
The algorithm in section 19.3 fails to reproduce the exact result of a pre-4.1 implementation:
- For (e.g.) the sequence U+1100 (ᄀ) HANGUL CHOSEONG KIYEOK + U+0300 (◌̀) COMBINING GRAVE ACCENT + U+1161 (ᅡ) HANGUL JUNGSEONG A + U+0323 (◌̣) COMBINING DOT BELOW, the algorithm of section 19.3 will reorder this in step 2 to: U+1100 (ᄀ) HANGUL CHOSEONG KIYEOK + U+1161 (ᅡ) HANGUL JUNGSEONG A + U+0323 (◌̣) COMBINING DOT BELOW + U+0300 (◌̀) COMBINING GRAVE ACCENT, which is then normalised in step 3 to U+AC00 (가) HANGUL SYLLABLE GA + U+0323 (◌̣) COMBINING DOT BELOW + U+0300 (◌̀) COMBINING GRAVE ACCENT
- However, an old pre-4.1 implementation will simply produce U+AC00 (가) HANGUL SYLLABLE GA + U+0300 (◌̀) COMBINING GRAVE ACCENT + U+0323 (◌̣) COMBINING DOT BELOW
(The examples are taken from http://www.unicode.org/review/pr-29.html, of course.)
A correct algorithm (unless I'm mistaken) would be:
1. Premap … (no change)
2. Apply the newer version of normalisation.
3. If the earlier version is before Unicode 4.1 and the later version is 4.1 or later, and the normalisation is NFC or NFKC, reorder the sequences listed in Table 10 of Section 19.5, Corrigendum 5 Sequences, as follows:
From: first_character | intervening_character(s) | last_character
To: first_character | last_character | intervening_character(s)
Then replace first_character and last_character with the equivalent composed character.
No feedback was received via the reporting form this period.
No feedback was received via the reporting form this period.
No feedback was received via the reporting form this period.
Date/Time: Wed Dec 23 19:41:03 CST 2009
Contact: corporate@khwilliamson.com
Name: Karl Williamson
Report Type: Error Report
Subject: Tr18 refers to superseded tr21
Several places in TR18, eg, 2.4, refer to this superseded tr.
Date/Time: Tue Jan 19 20:54:46 CST 2010
Contact: eubene@gmail.com
Name: Eugene Motoyama
Report Type: Error Report
Subject: UAX #38
In UAX #38: Unicode Han Database (Unihan), there is an error in Section 3.4 Readings, paragraph 5. The example given for a /kun/ pronunciation is wrong: "ichi" is the /on/ (Sino-Japanese) pronunciation of 一 (U+4E00). The /kun/ pronunciation is "hitotsu", among others. I also question the usefulness of the discussion of monosyllabic versus polysyllabic readings: the limited capacity of Japanese pronunciation often causes /on/ (Sino-Japanese) readings to be polysyllabic, such as in the example above.
Date/Time: Tue Dec 22 07:49:53 CST 2009
Contact: oh.neck@gmail.com
Name: Kenedi Hielo
Report Type: Feedback on an Encoding Proposal
Subject: Unicode Identifier and Pattern Syntax
May I request moving the code for Tagalog [:script=Tglg] from Table 4. Candidate Characters for Exclusion from Identifiers to Table 5. Recommended Scripts - Limited Use. I believe that it has recently overcome its "extinct" status and has become a thriving script since there are now newer documents (online) written in the script. The script may not have support on the national level (i.e. Philippine government), it is still in relative use by scattered communities and individuals with the intention of reviving it. Knowledge of the script is also widespread as a form of art (tattooing), and unlike the other scripts (Buhid [:script=Buhd], Hanunoo [:script=Hano], & Tagbanwa [:script=Tagb]), which are living yet limitedly regional in use, Tagalog has a strong presence only with at least five (5) available fonts and one (1) blog with others using it as a motif. Its location in Table 4 may not truly represent its current status of being in limited yet thriving.
Ed: Ken Whistler already responded on this issue.
Items in gray text below have been taken care of in the editorial committee.
Date/Time: Wed Nov 4 01:19:21 CST 2009
Contact: jedi787plus@aim.com
Name: Leroy Vargas
Report Type: Error Report
Subject: Missing KP Source Glyphs in Multi-Column Han Code Charts (CJK-A, CJK, CJK-B)
The multi-column Han ideographic code charts are missing the glyphs corresponding to the KP0-xxxx and KP1-xxxx sources for the following CJK blocks:
CJK Unified Ideographs Extension A (3400-4DB5) CJK Unified Ideographs (4E00-9FCB) CJK Unified Ideographs Extension B (20000-2A6D6)
Date/Time: Mon Nov 9 09:08:11 CST 2009
Contact: jamadagni@gmail.com
Name: Shriramana Sharma
Report Type: Error Report
Subject: Khmer Sign Atthacan
I've not read Unicode 5.2 chapter 11, but the following lines occur in Unicode 5.0 chapter 11:
U+17DD khmer sign atthacan is a rarely used sign that denotes that the base consonant character keeps its inherent vowel sound. In this respect it is similar to U+17D1 khmer sign viriam.
This is quite confusing. As I understand it, 17DD Atthacan denotes that the base consonant *keeps* its inherent vowel whereas 17D1 Viriam denotes that the base consonant *loses* its inherent vowel. The similarity between 17DD and 17D1 can only be that both of them are used to denote the inherent vowel status of the word-final consonant. Perhaps there is some ambiguity in Khmer regarding whether a word final consonant has the inherent vowel is pronounced or not, as is seen in some North Indian languages, which ambiguity the explicit use of either of these two signs will clear out. I think that this should be clearly stated in the script description quoted above.
(Note: I'm not a Khmer expert but a Sanskrit scholar who is currently being amazed, by going through the South East Asian Unicode code charts, at the extent to which Sanskrit grammatical words like Avagraha Samjnaa, Samyoga Samjnaa etc are used in South East Asian language descriptions. The above remark is from the viewpoint of an outsider and perhaps a Sanskrit expert, not from that of a Khmer expert.)
Date/Time: Mon Nov 9 09:08:56 CST 2009
Contact: jamadagni@gmail.com
Name: Shriramana Sharma
Report Type: Error Report
Subject: Some corrections for TUS ch 9
I've not read Unicode 5.2 chapter 9, but the following relates to Unicode 5.0 chapter 9 page 328:
In the shape of TUU, the "leg" (as it is called by natives) on the right hand side is detached from the main "TU" glyph. It should be ligated as seen immediately below for NUU. In the shape of LLLUU, the long vowel and short vowel signs overlap. The short vowel form should be removed.
The following relates to Unicode 5.0 chapter 9 page 311:
"Some Indic scripts—notably Tamil—lack a distinct digit for zero." -- This line should be removed, since while it is true that Tamil (and perhaps Malayalam also) formerly did not use a zero, they do use a zero currently, and all Indic blocks do have a zero.
Date/Time: Tue Nov 17 14:11:43 CST 2009
Contact: hperon@terra.com.br
Name: Henrique Peron
Report Type: Error Report
Subject: Unmatching decomposition info
Good morning,
while analyzing the decomposition info for the vietnamese latin letters, I found 6 chars which don't match their decomposition info:
1EB6/1EB7 (LATIN CAPITAL/SMALL LETTER A WITH BREVE AND DOT BELOW) According to their decomposition info (1EA0/1EA1 + 0306), their description should be "WITH DOT BELOW AND BREVE". On the other hand, comparing to all other etnamese pairs, what should be corrected instead is their decomposition info: 0102/0103 + 0323.
The same goes for the following three pairs: 1EAC/1EAD (LATIN CAPITAL/SMALL LETER A WITH CIRCUMFLEX AND DOT BELOW) 1EBE/1EBF (LATIN CAPITAL/SMALL LETTER E WITH CIRCUMFLEX AND DOT BELOW) 1ED0/1ED1 (LATIN CAPITAL/SMALL LETTER O WITH CIRCUMFLEX AND DOT BELOW)
Thanks for your attention,
have a nice day,
Henrique Peron
Date/Time: Sun Nov 29 21:12:05 CST 2009
Contact: rcmuir@gmail.com
Name: Robert Muir
Report Type: Error Report
Subject: 5.17 UTF-8 in UTF-16 order
Hello,
the algorithm presented in Chapter 5, section 17 provides a "rotate" method to sort UTF-8 in UTF-16 binary order.
unfortunately the table provided (or at least how it reads) is not a rotate but a shift, and shifts 0xee and 0xef after 0xf4 (with the 0xf0-0xf4 shifting backwards two places).
this shift causes an array lookup of rotate[0xf4] to return 0xef, but rotate[0xf2] to return 0xf4, and causes rotate[0xee] to return 0xf0, and rotate[0xef] to return 0xf1
if the table is implemented as presented, this causes incorrect sort order. instead, why not present this as a table that swaps 0xee with 0xfe, and 0xef with 0xff, respectively. this will cause U+10000..U+10FFFF to correctly sort below U+E000..U+FFFF
I apologize if this is intended to be read differently and I misinterpreted it, but in any case I think swapping 0xee with 0xfe and 0xef with 0xff is simpler and more clear.
Date/Time: Fri Dec 4 15:55:03 CST 2009
Contact: dmatson@microsoft.com
Name: David Matson
Report Type: Error Report
Subject: Typos in Unicode 5.2.0 Standard
1. Chapter 3, page 81:
In particular, guidelines for rendering of combining marks in conjunction with other *characers* should be considered as appropriate for defining default rendering behavior, in the absence of more specific information about rendering.
The word "characers" should be "characters" instead.
This error also appears in version 5.0 (page 109).
2. Chapter 7, page 226:
By convention, combining marks may be exhibited in (apparent) isolation by applying them *to to* U+00A0 no-break space.
The phrase "to to" should be "to" instead.
This error also appears in version 5.0 (page 254).
3. Chapter 7, page 224:
They are also used "to to" mark stress or tone, or may simply represent their own sound.
The phrase "to to" should be "to" instead.
No feedback was received via the reporting form this period.