Comments on Public Review Issues

L2/09-123

Comments on Public Review Issues
(January 29, 2009 - May 6, 2009)

The sections below contain comments received on the open Public Review Issues and other feedback as of January 29, 2009, since the previous cumulative document was issued prior to UTC #118 (February 2009).

127 Proposed Update UAX #44: Unicode Character Database
128 Proposed Update UTS #37: Unicode Ideographic Variation Database
133 Proposed Draft UTS #46: Unicode IDNA Compatible Preprocessing
134 Proposed Draft UAX #9: Unicode Bidirectional Algorithm
135 Proposed Draft UAX #11: East Asian Width
136 Proposed Draft UAX #14: Unicode Line Breaking Algorithm
137 Proposed Draft UAX #24: Unicode Script Property
138 Proposed Draft UAX #29: Unicode Text Segmentation
139 Proposed Draft UAX #31: Unicode Identifier and Pattern Syntax
140 Proposed Draft UAX #34: Unicode Named Character Sequences
141 Proposed Draft UAX #38: Unicode Han Database
142 Proposed Draft UAX #41: Common References for Unicode Standard Annexes
143 Proposed Draft UTS #10: Unicode Collation Algorithm
144 Proposed Draft UAX #42: Unicode Character Database in XML
145 Proposed Draft UAX #15: Unicode Normalization Forms
146 Suggested Restructuring of Text in Chapter 3
Other Reports
Feedback on Encoding Proposals
Closed Public Review Issues

127 Proposed Update UAX #44: Unicode Character Database

No feedback was received via the reporting form this period.

See also Other Reports re UnicodeData.txt below.

128 Proposed Update UTS #37: Unicode Ideographic Variation Database

No feedback was received via the reporting form this period.

133 Proposed Draft UTS #46: Unicode IDNA Compatible Preprocessing

No feedback was received via the reporting form this period.

134 Proposed Update UAX #9: Unicode Bidirectional Algorithm

No feedback was received via the reporting form this period.

135 Proposed Update UAX #11: East Asian Width

No feedback was received via the reporting form this period.

136 Proposed Update UAX #14: Unicode Line Breaking Algorithm

No feedback was received via the reporting form this period.

137 Proposed Update UAX #24: Unicode Script Property

No feedback was received via the reporting form this period.

138 Proposed Update UAX #29: Unicode Text Segmentation

Date/Time: Thu Mar 19 01:55:06 CST 2009
Contact: natta@th.ibm.com
Name: Nattapong
Subject: UAX#29 Text segmentation issue.

Section Grapheme Cluster Boundaries:

Many Thai characters were fall into prepend or extend. In my sense, only U+0E33 is falled into extend. Other character shall not be prepend nor extend.

Any explanation for this Thai specification ? If possible, please provide the contact for originator.

CLDR bug #2142, re UAX #29 clusters
Full_Name:Mark Davis
NOTE: This has already been submitted into the bug tracking system.

There are different ways in which to break into "characters".

Types:
1. code point boundaries
2. "spacing units" (#3, removing all but non-spacing marks from Extend, SpacingMark, Prepend, plus some tweaks)
3. Extended Grapheme Clusters (http://unicode.org/reports/tr29/)
4. Akshas (link clusters from #3 that have Virama in the first one)
... (maybe others)

Moreover, different choices of these may be appropriate for different locales, depending on the goal:

A. Arrow movement
B. Drop-Caps
C. Backspace (and Delete?)
... (maybe others)

To address this, we could do the following:

- Define rules for #2 and #4 (maybe more over time).
- Assign all of them IDs.
- On a per-locale basis, allow the ability to associate a goal with an ID.

139 Proposed Update UAX #31: Unicode Identifier and Pattern Synta

No feedback was received via the reporting form this period.

140 Proposed Update UAX #34: Unicode Named Character Sequences

No feedback was received via the reporting form this period.

141 Proposed Update UAX #38: Unicode Han Database (Unihan)

Date/Time: Mon May 4 00:31:56 CDT 2009
Contact: mpsuzuki@hiroshima-u.ac.jp
Name: suzuki toshiya
Subject: PRI#141: Proposed Update UAX #38: Unicode Han Database (Unihan)

Dear Sirs,

Here I report 2 proposals of the updates for Unihan.html.

kIRG_JSource

I propose the update of the descriptions for kIRG_JSource, to synchronize the descriptions in (the latest version of) ISO/IEC 10646.

# I think "Unified Japanese IT Vendors Contemporary Ideographs"
# is not widely published documentation, so I wish the reference
# to IRG N464 (Japanese submission to CJK Ext. A on 1997) is added,
# but it should be asked to ISO/IEC 10646 editors before all.

[current version]

 	   * J0 JIS X 0208:1990
 	   * J1 JIS X 0212:1990
 	   * J3 JIS X 0213:2000
 	   * J4 JIS X 0213:2000
 	   * JA Unified Japanese IT Vendors Contemporary Ideographs, 1993
 	   * J3A JIS X 0213:2004 level-3

[my proposal]

 	   * J0 JIS X 0208:1990
 	   * J1 JIS X 0212:1990
 	   * J3 JIS X 0213:2000 level-3
 	   * J4 JIS X 0213:2000 level-4
 	   * JA Unified Japanese IT Vendors Contemporary Ideographs, 1993
 	   * J3A JIS X 0213:2004 level-3

kMorohashi

There is a typographic error in kMorohashi description, "Dae" is wrong, "Dai" is correct. BTW, ISO/IEC 10646 Annex S romanizes its title as "Daikanwa Jiten". I think both of "Dai Kanwa Ziten" and "Daikanwa Jiten" are correct, but, if the synchronized notation were important, please copy the notation in ISO/IEC 10646 Annex S.

In addition, I'm not sure if the index of Morohashi Dai Kanwa is related with four-dictionary sorting algorithm. I think the order of the index in Morohashi Dai Kanwa is based on KangXi-Radical and the number of strokes. Is it essential to mention about the four-dictionary sorting algorithm?

[current version]

Description: The index of this character in the Dae Kanwa Ziten,
	aka Morohashi dictionary (Japanese) used in the
	four-dictionary sorting algorithm.

[my proposal]

Description: The index of this character in the Dai Kanwa Ziten,
	aka Morohashi dictionary (Japanese).

Regards,
suzuki toshiya

142 Proposed Update UAX #41: Common References for Unicode Standard Annexes

No feedback was received via the reporting form this period.

143 Proposed Update UTS #10: Unicode Collation Algorithm

Date/Time: Mon Apr 13 14:42:27 CDT 2009
Contact: rick@unicode.org
Name: Rick McGowan
Subject: PRI 143, UTS 10 review comments
NOTE: Already filed this in the bug tracking system.

In UTS 10, the table 7.3.1 Tertiary Weight Table has a lot of empty cells in the right-hand Samples columns of the table. Are those cells supposed to have stuff in them? Only a few of them actually appear filled when I look at the thing; the rest are nbsp. If they're not all supposed to be filled, then perhaps that fact could be commented in the text.

Date/Time: Fri May 1 19:18:35 CDT 2009
Contact: is@sun.com
Name: Ienup Sung
Subject: Some typos at PRI #143 UTS #10

In the "Example Differences" table of the section 1 Introduction, we have "Upper-first" and "Lower-First" at the second column of the last two rows. For the consistency, I think they need to be changed to "Upper-first:" and "Lower-first:", respectively.

In the section 1.5 Other Applications of Collation, the last paragraph has:

	    A sequence of characters considered to be a unit
	    in collation, such as ch in Slovak, represents a
	    tailored grapheme cluster. For applications of this,
	    see UTS #18: Unicode Regular Expression Guidelines
	    [UTS18]. For more information on grapheme clusters,
	    see UAX #29: Text Boundaries [UAX29].

And also in the References section, we have:

	    [UAX29]  UAX #29: Text Boundaries
	    [UTR30]  UTR #30: Character Foldings (draft)
	    [UTS18]  UTR #18: Unicode Regular Expression Guidelines

The "Unicode Regular Expression Guidelines", the "Character Foldings (draft)", and the "Text Boundaries" from the above need to be corrected to "Unicode Regular Expressions", "Unicode Character Foldins (withdrawn)", and "Unicode Text Segmentatino", respectively.

The "UTR #18:" for the [UTS18] at the References section also need to be changed to "UTS #18:".

Date/Time: Tue May 5 19:11:44 CDT 2009
Contact: is@sun.com
Name: Ienup Sung
Subject: PRI #143 UTS #10: Variable Weighting and File Format

In the section 3.2.1 File Format, we have:

The variable-weight line has three possible values that may change the weights of collation elements in processing (see Section 3.2.2, Variable Weighting). The default is shifted.

<variable> := '@variable ' <variableChoice> <eol>
<variableChoice> := 'blanked' | 'non-ignorable' | 'shifted'

and then in the section 3.2.2 Variable Weighting, we have:

There are four possible options for variable weighted characters, with the default being Shifted:

Blanked: ...
Non-ignorable: ...
Shifted: ...
Shift-Trimmed: ...

I think that either the "Shift-trimmed" be added at the section 3.2.1 or it should be removed from the section 3.2.2. If it should be added, then, for consistency, the "Shift-Trimmed" at the section 3.2.2 would need to be changed to "Shift-trimmed".

Date/Time: Wed May 6 01:38:16 CDT 2009
Contact: is@sun.com
Name: Ienup Sung
Subject: PRI #143 UTS #10: Some more typos

The title for the section 5.1 (at the section 5) has a typo:

5.1 Parametic Tailoring

where 'r' is missing in the first word.

In the same section, we've "Collation Parameters" table and in the table, we have "alternate" row where the options are the three, "non-ignorable", "shifted", and "blanked". If we are to keep the shift-trimmed at the sections 3.2.1 and 3.2.2, then, the second column for the "alternate" would also need to include the "shift-trimmed".

In the "Data Method" description at the section 7.1.4.1 Hangul Trailing Weights, we have the following bullet text:

This means that if L1 has a primary weight of 555, and L2 has 559, then L1L1 would have to be given a weight from 556 to 558.

The "L1L1" at the above should be "L1L2". (Note: The digits are in subscript style.)

In the section 8 Searching and Matching, we have the following which is the second sentence at the third paragraph (excluding bullet texts) from top:

Thus users of searching and matching need to be able modify parameters ...

I believe there is "to" missing at the above and the above should be changed to:

Thus users of searching and matching need to be able to modify parameters ...

144 Proposed Update UAX #42: Unicode Character Database in XML

Date/Time: Fri Mar 27 16:48:31 CST 2009
Contact: mary.holstege@marklogic.com
Name: Mary Holstege
Subject: 144 Proposed Update UAX #42: Unicode Character Database in XML 2009.05.04

Given the citation of XSLT and XQuery for processing XML, including the proposed XML representation of the Unicode Character Database, it is very unfortunate not to have a W3C XML Schema, as RelaxNG is not supported by XSLT and XQuery. In order to do type-aware processing using those tools, folks will need to create their own XSD.

In addition, using "Y"/"N" to represent boolean values creates an unnecessary impedence mismatch with such type-aware XML processing tools: Using xsd:boolean would be preferable, as it enables simpler selection expression based on boolean property values.

//Mary

145 Proposed Update UAX #15: Unicode Normalization Forms

No feedback was received via the reporting form this period.

146 Suggested Restructuring of Text in Chapter 3 for Clarification of Unicode Normalization

Date/Time: Wed Apr 15 10:24:01 CDT 2009
Contact: cowan@ccil.org
Name: John Cowan
Subject: PRI 146

PRI 146 involves moving part of a UAX into the Unicode book proper. UAXes are updated with every new release of Unicode, whereas the online and hardcopy versions of mainline book text are only updated with every major release. In 5.2 and any later 5.x releases, therefore, the relevant material will have been removed from the UAX but will not yet appear as part of the PDF file of Chapter 3. That's very confusing.

Since this is a purely editorial change, I suggest it be postponed to Unicode 6.0.

Other Reports

Date/Time: Tue Mar 3 17:35:41 CST 2009
Contact: rkunst@humancomp.org
Name: Richard Kunst
Subject: Typo in UTR #45

Shuō Wén Jiě Zhì => Shuō Wén Jiě Zì

Also suggest, to be more standard:

KangXi => Kangxi (or Kang Xi)

Date/Time: Wed Apr 22 01:36:34 CDT 2009
Contact: mibuhari@gmail.com
Name: Buhari
Subject: Annotation to 0767

Dear Sir,

The character at 0767 is also used by Arwi (Arabu-Tamil) script. Is it possible to add an annotation to that indicating that this character is used to Arwi.

Regards Buhari.

NOTE: See also L2/09-175, Report on Problems in Unihan Charts.

Date/Time: Tue May 5 11:28:02 CDT 2009
Contact: kent.karlsson14@comhem.se
Name: Kent Karlsson
Subject: comments on UnicodeData-5.2.0d5

1. Many (all) ISO 10646 remarks have been removed. Maybe some of them (far from all) should reappear as annotations in the nameslist file.

2. Considering:

	23E8;DECIMAL EXPONENT SYMBOL;So;0;ON;;;;;N;;;;;

This is a subscript 10. I think this character should have a compatibility decomposition as "10".

3. Three Hangul Jungseong have gotten names that are not coherent with other Hangul Jungseong:

	D7B1;HANGUL JUNGSEONG O-O-I;Lo;0;L;;;;;N;;;;;		expected: O-OE
	D7B6;HANGUL JUNGSEONG U-I-I;Lo;0;L;;;;;N;;;;;		expected: WI-I
	D7C1;HANGUL JUNGSEONG I-O-I;Lo;0;L;;;;;N;;;;;		expected: I-OE

I would suggest adding annotations with the more coherent (with other Hangul Jamo) names (see notes to the right of the copied UnicodeData data lines), as it's too late to change the character names.

4. The following are bidi ON and even supposed to mirror (when RTL):

	2202;PARTIAL DIFFERENTIAL;Sm;0;ON;;;;;Y;;;;;
	1D6DB;MATHEMATICAL BOLD PARTIAL DIFFERENTIAL;Sm;0;ON;<font> 2202;;;;Y;;;;;
	1D715;MATHEMATICAL ITALIC PARTIAL DIFFERENTIAL;Sm;0;ON;<font> 2202;;;;Y;;;;;
	1D74F;MATHEMATICAL BOLD ITALIC PARTIAL DIFFERENTIAL;Sm;0;ON;<font> 2202;;;;Y;;;;;
	1D789;MATHEMATICAL SANS-SERIF BOLD PARTIAL DIFFERENTIAL;Sm;0;ON;<font> 2202;;;;Y;;;;;
	1D7C3;MATHEMATICAL SANS-SERIF BOLD ITALIC PARTIAL DIFFERENTIAL;Sm;0;ON;<font> 2202;;;;Y;;;;;

I find that strange, since these are really just variants of the Latin letter small d (which is both bidi L and mirror N, of course). Let me guess that there is no implementation that actually mirror the partial differential signs. The "full differential sign" is an ordinary "d" (usually written in italic). Note also that Nabla, a modified delta, does not have this anomaly w.r.t. bidi.

5. These two sequences of characters do not have case mappings between them, but that might be expected, despite their g.c. of So. Note also that circled letters have a case mapping.

	1F110;PARENTHESIZED LATIN CAPITAL LETTER A;So;0;L;<compat> 0028 0041 0029;;;;N;;;;;
	...
	1F129;PARENTHESIZED LATIN CAPITAL LETTER Z;So;0;L;<compat> 0028 005A 0029;;;;N;;;;;

and

	249C;PARENTHESIZED LATIN SMALL LETTER A;So;0;L;<compat> 0028 0061 0029;;;;N;;;;;
	...
	24B5;PARENTHESIZED LATIN SMALL LETTER Z;So;0;L;<compat> 0028 007A 0029;;;;N;;;;;

6. The character

	1A6F;TAI THAM VOWEL SIGN AE;Mc;0;L;;;;;N;;;;;

is written as a doubled 1A6E (TAI THAM VOWEL SIGN E), and should thus have a canonical decomposition into <1A6E, 1A6E>.

Feedback on Encoding Proposals

Date/Time: Wed Mar 4 05:24:52 CST 2009
Contact: cebrail21a@freemail.hu
Name: Cebrail
Subject: B3566 Native Hungarian

I found an error in the proposed character properties of 0892-0893. 0891 Qvad dot is '<compat> 0020' and not the "break" mark. It slipped down a row in the table!

Date/Time: Wed Mar 4 10:15:59 CST 2009
Contact: kent.karlsson14@comhem.se
Name: Kent Karlsson
Subject: Move block from plane 1 to plane 2

In http://www.unicode.org/L2/L2009/09081-n3580-amd7.pdf :

The Squared ideographs/Circled ideographs, and maybe also Squared katakanas (sic) (the entire Encloded ideographic supplement block) would belong better in plane 2 than in plane 1.

Date/Time: Wed Mar 4 10:27:40 CST 2009
Contact: kent.karlsson14@comhem.se
Name: Kent Karlsson
Subject: Ophiuchus, duplicate encoding

In http://www.unicode.org/L2/L2009/09081-n3580-amd7.pdf :

Zodiacal symbol 1F320 OPHIUCHUS

The conventional symbol for the Zodiacal symbol OPHIUCHUS is
2695;STAFF OF AESCULAPIUS;So;0;ON;;;;;N;;;;;

I suggest "= Ophiuchus" is added in NamesList.txt for U+2695, and that U+1F320 OPHIUCHUS (which is just an attempt at introducing a redesigned symbol for Ophiuchus) is not encoded.

See:

http://en.wikipedia.org/wiki/Ophiuchus#Astrology

http://en.wikipedia.org/wiki/Sidereal_astrology#The_13_astronomical_constellations_of_the_ecliptic (in particular the table entry for Ophiuchus)

Date/Time: Fri May 1 18:40:36 CDT 2009
Contact: cowan@ccil.org
Name: John Cowan
Subject: L2/09-171 (aka N3643) 2FFF

The West proposal to extend IDSes to ideographic scripts other than Han, and to add a few new IDC characters, is a Good Thing, and I heartily support it.

However, I do not believe that the character 2FFF IDEOGRAPHIC DESCRIPTION CHARACTER INDEPENDENT is justified. Its intention is to create an IDS containing just one character, the following one. In effect, <2FFF,4E01> is simply a synonym for <4E01> (and likewise with any other non-decomposable ideograph). Furthermore, U+2FFF is a wholly new type of IDC, the unary, to add to the existing binary and trinary IDC types.

As far as the proposal shows, U+2FFF is proposed merely so that in the special case of an IDS table, all ideographs can uniformly be represented by IDSes, even the non-decomposable ideographs. For such a specialized purpose, any non-ideographic character would do, or a private-use character, or even a non-character codepoint. I see no reason to encode U+2FFF for public use.

Closed Public Review Issues

No feedback was received via the reporting form this period.

L2/09-123