Comments on Public Review Issues

L2/10-112

Comments on Public Review Issues
(January 26, 2010 - May 10, 2010)

The sections below contain comments received on the open Public Review Issues and other feedback as of May 3, 2010, since the previous cumulative document was issued prior to UTC #122 (February 2010).

During this period, two other reports were received, and already acted on by the Editorial Committee: see documents L2/10-156 and L2-10-157.

151 Proposed Update UAX #44: Unicode Character Database
152 Proposed Update UAX #15: Unicode Normalization Forms
156 Proposed Update UAX #9: Unicode Bidirectional Algorithm
157 Proposed Update UAX #11: East Asian Width
158 Proposed Update UAX #14: Unicode Line Breaking Algorithm
159 Proposed Update UAX #24: Unicode Script Property
160 Proposed Update UAX #29: Unicode Text Segmentation
161 Proposed Update UAX #31: Unicode Identifier and Pattern Syntax
162 Proposed Update UAX #34: Unicode Named Character Sequences
163 Proposed Update UAX #38: Unicode Han Database (Unihan)
164 Proposed Update UAX #41: Common References for UAXes
165 Proposed Update UAX #42: Unicode Character Database in XML
166 Proposed Update UTS #10: Unicode Collation Algorithm
167 Ideographic Variation Database Submission
168 Two New Provisional Properties for Characters in Indic Scripts
Feedback on Encoding Proposals
Feedback TUS 5.2 and Charts
Closed Public Review Issues
Other Reports

151 Proposed Update UAX #44: Unicode Character Database

No feedback was received via the reporting form this period.

152 Proposed Update UAX #15: Unicode Normalization Forms

No feedback was received via the reporting form this period.

156 Proposed Update UAX #9: Unicode Bidirectional Algorithm

No feedback was received via the reporting form this period.

157 Proposed Update UAX #11: East Asian Width

No feedback was received via the reporting form this period.

158 Proposed Update UAX #14: Unicode Line Breaking Algorithm

No feedback was received via the reporting form this period.

159 Proposed Update UAX #24: Unicode Script Property

No feedback was received via the reporting form this period.

160 Proposed Update UAX #29: Unicode Text Segmentation

No feedback was received via the reporting form this period.

161 Proposed Update UAX #31: Unicode Identifier and Pattern Syntax

Date/Time: Sat May 1 10:41:43 CDT 2010
Contact: gihan@uom.lk
Name: Gihan Dias
Subject: Public Review Issue
Opt Subject: response to PR 161

I propose that para 5 in section 2.3 of UAX#31 be written as follows:

Thus for such circumstances, an implementation shall allow the following Join_Control characters in the limited contexts as specified in A1, A2, and B below. Note, however, that while these restrictions limit visual confusability greatly, they do not prevent it. For example, as Tamil only uses a Join_Control character in one specific case, most of the sequences allowed in Tamil by these rules are, in fact, visually confusable. Therefore, implementations may choose, based on their knowledge of the script concerned, to implement tighter restrictions than specified below. There are also cases where the presence of a joiner preceding a virama makes a visual distinction. It is currently unclear whether retention of a joiner in this context is required in identifiers.

162 Proposed Update UAX #34: Unicode Named Character Sequences

No feedback was received via the reporting form this period.

163 Proposed Update UAX #38: Unicode Han Database (Unihan)

Date/Time: Mon Apr 19 05:25:42 CDT 2010
Contact: dbvic@mac.com
Name: Didier BARBAS
Subject: Other Question, Problem, or Feedback
Opt Subject: U+7DCA / kDefinition

Hello,

Considering that many kDefinition entries include the Cantonse meaning (which can be very different from Mandarin), I think this character's kDefinition entry should include a Cantonese definition: progressive aspect marker. Eg: 你做緊乜嘢呀？What are you doing?

Regards,

--
Didier Barbas

164 Proposed Update UAX #41: Common References for Unicode Standard Annexes

No feedback was received via the reporting form this period.

165 Proposed Update UAX #42: Unicode Character Database in XML

Date/Time: Sat Mar 27 06:01:32 CST 2010
Contact: ernestvandenboogaard@hotmail.com
Name: Ernest van den Boogaard
Subject: Public Review Issue
Opt Subject: UAX #42: Unicode Character Database in XML

UAX #42 (XML) is to be updated for Unicode 6.0.0. Currently, for public review, the Age is defined thus: 4.4.1 Age property [age, 11] = code-point-properties &= attribute age { "1.1" | "2.0" | "2.1" | "3.0" | "3.1" | "3.2" | "4.0" | "4.1" | "5.0" | "5.1" | "5.2" | "unassigned" }?

We might expect the value "6.0" be included.

166 Proposed Update UTS #10: Unicode Collation Algorithm

Date/Time: Mon Mar 29 11:11:08 CST 2010
Contact: kent.karlsson14@comhem.se
Name: Kent Karlsson
Subject: Error Report
Opt Subject: Collation of ROMAN NUMERAL SIX LATE FORM

ROMAN NUMERAL SIX LATE FORM is collated among the decimal digit 6-es. No other roman numeral is collated as a decimal digit. This one does not even look like a 6, but more like a capital C with hook or (as noted) greek letter stigma. I think this character should be collated among the not-quite Latin letters roman numerals (like ROMAN NUMERAL ONE THOUSAND C D).

167 Ideographic Variation Database Submission

No feedback was received via the reporting form this period.

168 Two New Provisional Properties for Characters in Indic Scripts

Date/Time: Thu Apr 15 19:23:47 CDT 2010
Contact: sisrivas@yahoo.cob
Name: Sinnathurai Srivas
Subject: Public Review Issue
Opt Subject: ndic_Syllabic_Category and Matra_Placement

Stacking Tamil Vowels
Diphthongs and elongated vowels in Tamil

Ref 1: http://www.araichchi.net/chiirmai/Stacking_Tamil_Vowels.html

Ref 2: http://www.araichchi.net/kanini/unicode/Tamil_Unicode_Fallback.jpg

Tamil Grammar Tolkappiyam defines how to stack or how to elongate vowels. The rule is as follows

நீட்டம் வேண்டின் அவ் அளபுடைய
கூட்டி எழூஉதல் என்மனார் புலவர். 6

Ideally, the elongating vowel could be the dependent vowel, as in the Tamil Brami system and as in the system existed before Tamil Bhrami. Due to constraints placed during the introduction of industrial printing (dependent vowels placed illogically around consonants), the independent vowel sign is used in vowel staking. All combinations diphthongs and all elongation of vowels are required in Tamil. For a start there are discussions even in splitting up the traditional Ai and Au.

For this reason, Unicode consortium should not define staking rules that contradicts the linear rule in Tamil grammar. For this reason, Unicode consortium should not intervene and define new rules to stag vowels in Tamil, rather it should allow natural, linear combining of vowels.

Though the contemporary use utilises independent vowels for staking, Unicode consortium should allow staking dependent and independent vowels without intervention, as it was used during and before Tamil Bhrami.

Ie, Unicode should not introduce contradicting new Grammar rules for Tamil.

see the sample linear staking.

In addition, a visible dependent "a" should be allowed for internal computer processing purposes (such as sorting) and also for analytically purposes such as for publishing the pre Tamil Bhrami way of writing, where dependent short "a" was in use too.

Also can we NOT allow Unicode consortium to introduce a new diacritics Nuka, before we formalise the diacritic system for Tamil, which will be used with pronunciation dictionary, etc.

Sinnathurai Srivas

Subject: Feedback on PR 168
Date: Fri, 16 Apr 2010 08:23:30 +0530
From: Shriramana Sharma <samjnaa@gmail.com>

First I should say I welcome this effort by Unicode. I hope it quickly achieves its intended purpose of making sure that implementations handle Indic scripts correctly.

A few points:

1. BINDI --> BINDU

I don't think the name "bindi" which is specific to Hindi is a suitable word for a category referring to anusvara/anunasika characters. Unicode has always been standardizing on Sanskrit-based words like virama for characters which have equivalents across many Indic scripts (and avoids using script-specific names like halant, pulli etc). Similarly, an alternative to "bindi" should be found.

Is there a particular reason that "anusvara" itself cannot be used? If it is said that it cannot be used to refer to characters that are *chandrabindu* and not anusvara, then please use the word "bindu" (and not "bindi") which is Sanskrit and refers to a dot, which when used by itself (mostly) refers to just anusvara and when used with a chandra (literally a moon, crescent shaped character) denotes the anunasika or chandrabindu. Therefore I suggest "bindu" as an appropriate alternative which would be acceptable to South Indians too. "bindu" is a recognized word in both North- and South- Indian languages meaning a dot or drop (as of water), whereas "bindi" is North-Indian only. Perhaps Hindi-only.

As for the shape, a "bindu" may be just a dot or circular, comprising the South Indian rounded equivalents also. This is just like the word "pulli" in Tamil which may be just a dot or a circle.

Therefore I suggest that the word "bindu" be used.

2. TAMIL SIGN ANUSVARA

Must this character really be used? It is a totally useless and in fact *non-existent* character just like the appendix in humans. It even carries an "annotation" "Not used in Tamil" whereas the fact is that it is not used in Tamil *or anywhere else*. So either this character should be deprecated or at least ignored for serious documents like this (Indic properties).

3. KAITHI

The Kaithi script (and perhaps other minor/archaic scripts) seem to be ignored. Is this intentional?

--
Shriramana Sharma

Date/Time: Sat May 1 09:24:44 CDT 2010
Contact: gihan@icta.lk
Name: Gihan Dias
Subject: Public Review Issue
Opt Subject: response to PR 168

The ICTA Agency of Sri Lanka has studied the proposed Indic Properties document (PR 168), and makes the following recommendations.

1. 0B83 Tamil Sign Aytham (Unicode Name TAMIL SIGN VISARGA) has been classified as "diacritic_letter". This is a special letter which has two functions:

a. to modify a short vowel (including the inherent vowel of a consonant), in which case it follows the vowel or consonant; and

b. to signify the English sound f or ph by preceding the letter ப (PA).

While the term "diacritic" may be technically correct, its most common meaning is a sign placed above a letter, which is not the way the Aytham is used.

Therefore, we propose that either

a. a more suitable name for this classification is identified or

b. this character is not classified.

2. The introduction to the section Matra Placement should clearly indicate that the placements given in the table are the "general" placements which are valid for the majority of consonants. However, there are many exceptions, where the matra for a particular consonant-vowel combination does not follow the value given. Implementers should be aware of such exceptions. (Alternatively, we could try to enumerate a list of such exceptions.)

3. We recommend the following modifications to the Matra placements.

0DDA SINHALA VOWEL SIGN DIGA KOMBUVA - Matra_Placement=Top_And_Left
0BBF TAMIL VOWEL SIGN I - Matra_Placement=Top_And_Right
0BC0 TAMIL VOWEL SIGN II - Matra_Placement=Top_And_Right

4. The Tamil vowel signs U and UU may appear on various sides of the consonant, depending on the base consonant. We recommend that a new value "Variable" be defined for these:

0BC1 TAMIL VOWEL SIGN U - Matra_Placement=Variable
0BC2 TAMIL VOWEL SIGN UU - Matra_Placement=Variable

5. We recommend that the viramas be also be given the Matra Placement property, and recommend the following for Tamil and Sinhala.

0BCD TAMIL SIGN VIRAMA - Matra_Placement=Top
0DCA SINHALA SIGN AL-LAKUNA - Matra_Placement=Top

Gihan Dias
for ICT Agency of Sri Lanka

Feedback on Encoding Proposals

No feedback was received via the reporting form this period.

Feedback on TUS 5.2 and Charts

Date/Time: Tue Apr 13 13:15:57 CDT 2010
Contact: liancu@microsoft.com
Name: Laurentiu Iancu
Subject: Error Report
Opt Subject: Typo in Section 10.10 Ol Chiki

Hello,

There is a small typo in Section 10.10 Ol Chiki, p. 331 of TUS 5.2 (http://www.unicode.org/versions/Unicode5.2.0/ch10.pdf). In paragraph "Modifier Letters," the third vowel that can be modified with gaahlaa ttuddaag should be U+1C6E instead of U+1C6F. The former is a vowel (le) whereas the latter is a consonant (ep) that cannot be thus modified. This observation concurs with the description in sections "Vowels" and "Gahla Tudag" of http://wesanthals.tripod.com/id45.html.

Regards,
Laurentiu

Date/Time: Sat Apr 17 10:00:28 CDT 2010
Contact: dancecile@gmail.com
Name: Dan Cecile
Subject: Error Report
Opt Subject: Typo in the Standard

Hi,

On page 186 of The Unicode Standard, Version 5.2, there is a typo in the last paragraph of the page. The text reads "indicates a change a change in the capitalization" and the fix is to write "indicates a change in the capitalization".

Thanks,
Dan Cecile

Closed Public Review Issues

Date/Time: Mon May 10 03:00:41 CDT 2010
Contact: karl-pentzlin@acssoft.de
Name: Karl Pentzlin
Report Type: Other Question, Problem, or Feedback
Opt Subject: Feedback on www.unicode.org/reports/tr36/tr36-8.html

Regarding Unicode Technical Report #36: Unicode Security Considerations, 2010-04-28 (draft 5), table 3 "Single-Script spoofing": Maybe the following two additional problems may deserve mentioning.

1.: Lower case glyph identity: U+01DD vs. U+0259
2.: Illegal decomposition: U+0069 vs. U+0131 U+0307

(Maybe I have overlooked something which rules out these cases anyway, but I succeeded to register .com domains with these characters until the point where I had to actually register them and to pay for them, while this was not possible for .de domains.)

Other Reports

Date/Time: Thu Apr 8 08:23:15 CDT 2010
Contact: dssdsmd@hotmail.com
Name: Dhafer
Subject: Error Report
Opt Subject: Arabic latter Teh Marbuta U+0629

Hello,

I'm a native Arabic speaker and I noticed a "mistake" with the Unicode system which is that the letter Teh Marbuta has no medial form.

The Arabic letter Teh Marbuta (U+0629) should have a medial form which is the same as the Arabic letter Teh (U+062A) in medial form (U+FE98) i.e. when Teh Marbuta is between two letters it should turn into Teh medial form.

Thank you.

L2/10-112

Comments on Public Review Issues (January 26, 2010 - May 10, 2010)

Contents:

Comments on Public Review Issues
(January 26, 2010 - May 10, 2010)