Comments on Public Review Issues

L2/09-023R

Comments on Public Review Issues
(October 28, 2008 - January 29, 2009)

The sections below contain comments received on the open Public Review Issues and other feedback as of January 29, 2009, since the previous cumulative document was issued prior to UTC #117 (October 2008).

127 Proposed Update UAX #44: Unicode Character Database
128 Proposed Update UTS #37: Unicode Ideographic Variation Database
130 Word Break Property for ZWSP
131 Han Exemplar characters
132 Code Point Name/Label Options
133 Proposed Draft UTS #46: Unicode IDNA Compatible Preprocessing
Other Reports
Feedback on Encoding Proposals
Closed Public Review Issues

127 Proposed Update UAX #44: Unicode Character Database

Date/Time: Thu Jan 29 13:03:08 CST 2009
Contact: markus.icu@gmail.com
Name: Markus Scherer
Report Type: Public Review Issue
Subject: PRI 127 PU-UAX #44 UCD

PRI #127 Proposed Update UAX #44: Unicode Character Database

Re the discussion "UCD.html and simple titlecase" on the unicode list. I am looking at Version Unicode 5.2 draft 3 (http://www.unicode.org/reports/tr44/tr44-3.html)

The Property Table in section 5.1 contains notes like

"Note: The case foldings are omitted in the data file if they are the same as the code point itself."
and
"Note: The simple titlecase may be omitted in the data file if the titlecase is the same as the uppercase."

These notes read like instructions to someone _constructing_ the data file. What would be clearer to users of the UCD would be a note for how to _read_ the data file. For the titlecase value, the note could read "If the simple titlecase value is omitted, then the value is the same as the simple uppercase value." If this is not true, then the note should be removed.

More generally, PU-UAX #44 4.2.8 Default Values already covers omitted mapping values, maybe with the exception of the titlecase note: "For string properties, including the definition of foldings, the default value is the code point of the character itself." For the purpose of _reading_ UnicodeData.txt, this makes the notes on fields 12 & 13 (uppercase & lowercase) redundant.

For clarity, I think it would be best to remove redundant notes, such as the notes for the simple case mapping fields, and to make sure that the values are listed in the data files in all cases that might have been confusing in previous versions of the documentation (UCD.html). This leaves the documentation of the default values to the general comment in 4.2.8 and in the first column of the table about UnicodeData.txt.

4.2.8 Default Values says "Because of the legacy format constraints for UnicodeData.txt, that file contains no specific information about default values for properties. The default values for fields in UnicodeData.txt are documented instead in the UnicodeData.txt entry in the Property Table section below."

Some of these are redundant, and can be confusing. (At a glance, if I see "<code point>" I may not realize that the default value is the code point itself; I may think that the *type* of the value is a code point.)

I suggest to remove the defaults from the Property Table and instead change the sentence in 4.2.8 to something like "Because of the legacy format constraints for UnicodeData.txt, that file contains no specific information about default values for properties. The default values for fields in UnicodeData.txt are documented in the following Default Values for Properties table if they cannot be derived from the general rules about default values."

I believe the relevant not-already-derivable default values that should be added to the Default Values for Properties table are:

General_Category (Cn)
Bidi_Class (L, AL, R)
-- The Property Table says
    The default property values depend on the code point,
    and are given in DerivedBidiClass.txt
  which should be copied to the Default Values for Properties table
  and the sentence under the table should be revised.
Numeric_Type (None)

The other UnicodeData.txt properties have default values that are either already in this Default Values for Properties table or follow the general rules about default values. (Mapping to self, empty misc-string values, N for binary.)

128 Proposed Update UTS #37: Unicode Ideographic Variation Database

Date/Time: Wed Dec 10 21:35:05 CST 2008
Contact: john.knightley@gmail.com
Name: John Knightley
Report Type: Public Review Issue
Subject: UTS #37 Proposed changes

Dear Eric Muller et al,

The proposed change to UTS #37 whilst on the surface looks appealing would I think be a very damaging change to make. The new wording particularly indicates that a different encoding system than unicode/ISO10646 could be implemented using IVSes.

The original wording of UTS #37 indicated the purpose of IVSes was to allow in plain text a finer degree of distinction, the new wording permits the unification of separately encoded characters using IVS, which it a odds with the stability rules. The permitting of a separately encoded character to be the IVS of a different base character is also a odds with the name of non cjk variation selectors. Where there is a need to treat separately encoded characters is the same this can and should be done using some type of a mapping, not IVSes.

It such a change was introduced the security issues are very great indeed, this would lead developers concerned with security, to restrict the applications and processes that render IVS, and even in some processes such as say copy and paste consider the possibility the removing of the selectors and retaining only the base character.

The proposed change, fundamentally changes the standard, the original wording should be retained.

Your sincerely
John Knightley

130 Word Break Property for ZWSP

Date/Time: Mon Jan 26 11:22:55 CST 2009
Contact: pedberg@apple.com
Name: Peter Edberg
Report Type: Public Review Issue
Subject: Apple feedback on PRI #130

Re: #130 Word Break Property for ZWSP

Apple strongly supports the proposed change to the Word_Break property of U+200B ZERO WIDTH SPACE (ZWSP) from Format to Other, in order to allow ZWSP to continue to be used (or to once again be used) to mark word boundaries especially in southeast Asian scripts.

131 Han Exemplar characters

Date/Time: Sun Nov 16 02:02:54 CST 2008
Contact: pool@utilika.org
Name: Jonathan Pool
Report Type: Public Review Issue
Subject: Han Exemplar Characters

In case it is of any use, Utilika Foundation would be happy to make available the tabulation of Han character frequencies in any particular languages from PanLex, a database containing about 11 million lemmata of lexemes in about 1250 languages. (Each lemma appears only once.) The tabulation for about 985,000 Mandarin lemmata contains about 15,000 Han characters that occur at least once, with frequencies ranging from 1 to about 47,000.

Date/Time: Mon Jan 26 11:25:31 CST 2009
Contact: pedberg@apple.com
Name: Peter Edberg
Report Type: Public Review Issue
Subject: Apple feedback on PRI #131

Re: #131 Han Exemplar characters

Apple supports the expansion of the CLDR exemplar sets to include more Han ideographs. As far as the options for what sets of characters to use, another possible option not mentioned in the background document is the set of Han ideograph characters currently in the CLDR collation data for each of the relevant languages. I think perhaps the best basis for the Han characters in the exemplar sets would be the union of the following:

a) the characters from relevant government-defined standards for commonly-used characters or characters for education;

b) the level 1 and 2 Han characters from the basic character sets;

c) the characters in the relevant CLDR collation data.

These three sets have a high degree of overlap, and I think the collation data for the most part already includes the characters from the other two sets.

Here is how these turn out numerically:

1. Japanese / Japan - Currently the CLDR main exemplar set for "ja" includes 1934 kanji.

a) Jōyō (everyday use) kanji set: 1945

b) JISX0208 level 1 kanji 2965, level 2 kanji 3384, extra kanji 6, total 6355.

c) CLDR ja-standard collation: 6355.

2. Chinese simplified / China mainland - Currently the CLDR main exemplar set for "zh" includes 2074 hanzi.

a) Primary school Changyong Hanzi 2500, middle school Cichangyong Hanzi adds 1000, standard list of Tongyong Hanzi adds 3500 for a total of 7000.

b) GB2312-1980 level 1 hanzi 3755, level 2 hanzi 3008, total 6763.

c) CLDR zh-gb2312han collation: 6769 (zh-standard collation has 20994 hanzi, but I think that includes traditional characters).

3. Chinese traditional / Taiwan - Currently the CLDR main exemplar set for "zh_Hant" includes 2106 hanzi.

a) For education the basic set is 4808 hanzi, the additional set is 6341 for a total of 11149.

b) Big5 / CNS11643 level 1 hanzi 5401, level 2 hanzi 7650 (+ 2 duplicates in Big5), total 13051.

c) CLDR zh-big5han collation 13060, zh-stroke collation 13057. For Hong Kong (zn_Hant_HK) the basic exemplar set would probably need more characters.

132 Code Point Name/Label Options

Date/Time: Fri Nov 14 21:54:04 CST 2008
Contact: cowan@ccil.org
Name: John Cowaan
Report Type: Public Review Issue
Subject: pr-132

I favor Option C with two small changes. I see no sense in having two different properties, a long-established name property and a novel non-name property, where the non-name is the same as the name when there is a name, and otherwise the non-name is immutable when the codepoint is assigned and mutable when it isn't.

Let's just go with a single name property, and accept that names of the form RESERVED (NN)NNNN are temporary, whereas all other names are immutable. If we do that, I see no need for artificial hyphens in the names, though are stuck with them in CJK UNIFIED IDEOGRAPH-4E00 and friends already. In any case, we wouldn't have both SURROGATE-D800 and SURROGATE D800, both because of formal rules and because it flies in the face of common sense.

Note that I write CAPS in these names because under Option C as modified, PRIVATE USE 10FFFF is a formal and immutable character name just like LATIN CAPITAL LETTER A.

The other change to Option C I propose: Use the usual names of the ISO controls where they exist. It is much clearer to use the name LINE FEED for U+000A rather than CONTROL 000A. Granted that under some ISO 2022 character set settings, U+000C may not actually be FORM FEED, but only niche character sets exercise that hypothetical freedom nowadays.

Date/Time: Mon Jan 26 11:43:07 CST 2009
Contact: pedberg@apple.com
Name: Peter Edberg
Report Type: Public Review Issue
Subject: Apple feedback on PRI #132

Re: #132 Code Point Name/Label Options

With respect to the background document http://www.unicode.org/review/pr-132.html, Apple supports option A: Define a new derived Code Point Label property. Unicode already carefully distinguishes concepts such as character, code point, and code unit, so it should not be difficult for users to similarly distinguish Code Point Label from Name. This option clarifies and regularizes existing practice without changing the scope of applicability of the normative Name property, and without introducing issues such as non-immutability of the Name property for reserved code points (the name would change when a character is assigned to the code point).

133 Proposed Draft UTS #46: Unicode IDNA Compatible Preprocessing

No feedback was received via the reporting form this period.

Other Reports

Date/Time: Thu Nov 6 17:15:20 CST 2008
Contact: kent.karlsson14@comhem.se
Name: Kent Karlsson
Report Type: Error Report
Subject: Wrong script identification is given for Khutsuri characters

Scripts.txt says:

10A0..10C5 ; Georgian # L& [38] GEORGIAN CAPITAL LETTER AN..GEORGIAN CAPITAL LETTER HOE
2D00..2D25 ; Georgian # L& [38] GEORGIAN SMALL LETTER AN..GEORGIAN SMALL LETTER HOE

and "Georgian" here is an alias for "Geor".

However, these characters are not in the script
Geor 240 Georgian (Mkhedruli)
but instead are of the (closely related) script
Geok 241 Khutsuri (Asomtavruli and Nuskhuri)

Date/Time: Tue Nov 11 20:11:38 CST 2008
Contact: kenw@sybase.com
Name: Ken Whistler
Report Type: Error Report
Subject: New named sequences have name error

Among the named sequences just accepted by the UTC are two with erroneous name formations.

025A 0300 LATIN SMALL LETTER HOOKED SCHWA WITH GRAVE
025A 0301 LATIN SMALL LETTER HOOKED SCHWA WITH ACUTE

U+025A is LATIN SMALL LETTER SCHWA WITH HOOK

True, it is a "hooked schwa" in some sense, but the normal rules for specifying a character name with two diacritics would lead instead to:

025A 0300 LATIN SMALL LETTER SCHWA WITH HOOK AND GRAVE
025A 0301 LATIN SMALL LETTER SCHWA WITH HOOK AND ACUTE

Please correct these named character sequences and propagate the required change into the Amd 7 ballot.

Date/Time: Mon Nov 17 04:49:38 CST 2008
Contact: antemir@mail.ru
Name: Andy Popov, UniAlf project author
Report Type: Feedback on an Encoding Proposal
Subject: The Universal alphabet project

1. Now the English language becomes the real international language. It has one great advantage - it has very simple grammar.

2. But the English alphabet... requires the modification!!! Former President USA Benjamin Franklin, even had made his own project... So, between other scripts, don't forget, please, the expanded English alphabet!!!

According to Unialf, the expanded alphabet may contain 34 letters all the Latin, one of Latin-B, one of the Greek charset, and may be, 4 cirillic. Ond some more...

You may laugh_gh, but first time Universal Alphabet (and the expanded English already) they will work as small subset of GREAT CHARSET UNICODE!!!

Date/Time: Thu Dec 11 19:35:20 CST 2008
Contact: roozbeh@htpassport.com
Name: Roozbeh Pournader
Report Type: Error Report
Subject: Glyphs for U+075E and U+075F switched in 5.0
NOTE: An erratum was already issued for this on December 22, 2008.

It seems that the representative glyphs for the following characters, added in 4.1, were mistakenly switched in Unicode 5.0. The glyphs remain switched (and incorrect) in 5.1 charts.

The characters are:

U+075E ARABIC LETTER AIN WITH THREE DOTS POINTING DOWNWARDS ABOVE
U+075F ARABIC LETTER AIN WITH TWO DOTS VERTICALLY ABOVE

But the glyphs in the 5.0 and 5.1 charts show U+075E with two dots and U+075F with three dots.

The glyphs were all right in 4.1:
http://unicode.org/charts/PDF/Unicode-4.1/U41-0750.pdf

Date/Time: Sun Dec 21 07:38:27 CST 2008
Contact: hibernate@linuxmail.org
Name: Mattias
Report Type: Error Report
Subject: Roman Numeral Four (U+2163/U+2173)
NOTE: Ken Whistler already answered this.

In roman numerals 4 = IIII, but if the numbers includes 4 but is not 4 (eg 14, 24) then 4 = IV. So 14 = XIV and 24 = XXIV, but 4 = IIII.

IV = U+2163,
iv = U+2173,
but IIII and iiii seems to be missing.

Date/Time: Tue Dec 23 02:10:10 CST 2008
Contact: asmus@unicode.org
Name:
Report Type: Error Report
Subject: Line Breaking Corrections

These come from a discussion with Laurentiu

1) Rule 0.2 in LineBreakTest.html provides a break opportunity at sot, seen in all test cases in LineBreakTest.txt. This is in contradiction to rule LB2. Rule LB2 states to never break at the start of text.

The test file should match the published definition of the algorithm.

2) The textual description of class PO mentions that

"Therefore the line breaking algorithm by default does not break between PO and numbers or letters on either side."

The words "or letters" should be stricken as they contradict the rules as published.

Date/Time: Fri Jan 16 15:22:49 CST 2009
Contact: roozbeh@htpassport.com
Name: Roozbeh Pournader
Report Type: Error Report
Subject: U+0616 incorrectly classified as Koranic
NOTE: Ken Whistler already answered this.

In both the 5.1 charts and the NamesList.txt file, U+0616 ARABIC SMALL HIGH LIGATURE ALEF WITH LAM WITH YEH is classified under the "Koranic annotation signs". While the character is similar in Unicode behavior to Koranic annotation marks, it's never used in Korans.

The character was instead used in early Persian orthographies, as can be confirmed from its proposal, L2/06-345R.

Date/Time: Wed Jan 21 14:12:07 CST 2009
Contact: corporate@khwilliamson.com
Name: Karl Williamson
Report Type: Other Question, Problem, or Feedback
Subject: Shouldn't Cased and CaseIgnorable be in DerivedCoreProperties.txt?

These two properties are needed for a complete implementation of casing, so I would think they are core, and hence should be calculated by the consortium and included in the data file of derived core properties

Feedback on Encoding Proposals

Date/Time: Sun Nov 23 05:53:10 CST 2008
Contact: rishikesh@fedoraproject.org
Name: Rishikesh Sharma
Report Type: Feedback on an Encoding Proposal
Subject: Meitei Mayek Encoding Status

Hi Team,

I would like to know about the current Meitei Mayek Encoding status or progress in Unicode. In which version will we get Meitei Mayek Support in Unicode and when it is releasing or what is the expected release date. I am planning for localization of Meitei Mayek in Fedora Linux for the people of Manipur. All the major Indian language have been included already. If you need support or assistance, do let me know.

Warm Regards,
Rishikesh Sharma
Fedora Ambassador
Imphal, Manipur.
rishikesh@fedorapeoject.org

Closed Public Review Issues

No feedback was received via the reporting form this period.

L2/09-023R

Comments on Public Review Issues (October 28, 2008 - January 29, 2009)

Contents:

Comments on Public Review Issues
(October 28, 2008 - January 29, 2009)