L2/09-230

Comments on Public Review Issues
(May 6, 2009 - August 4, 2009)

The sections below contain comments received on the open Public Review Issues and other feedback as of August 4, 2009, since the previous cumulative document was issued prior to UTC #119 (May 2009).

Contents:

127 Proposed Update UAX #44: Unicode Character Database
     UAX #44 Text Comments
     UCD Comments
128 Proposed Update UTS #37: Unicode Ideographic Variation Database
133 Proposed Draft UTS #46: Unicode IDNA Compatible Preprocessing
134 Proposed Draft UAX #9: Unicode Bidirectional Algorithm
135 Proposed Draft UAX #11: East Asian Width
136 Proposed Draft UAX #14: Unicode Line Breaking Algorithm
137 Proposed Draft UAX #24: Unicode Script Property
138 Proposed Draft UAX #29: Unicode Text Segmentation
139 Proposed Draft UAX #31: Unicode Identifier and Pattern Syntax
140 Proposed Draft UAX #34: Unicode Named Character Sequences
141 Proposed Draft UAX #38: Unicode Han Database
142 Proposed Draft UAX #41: Common References for Unicode Standard Annexes
143 Proposed Draft UTS #10: Unicode Collation Algorithm
144 Proposed Draft UAX #42: Unicode Character Database in XML
145 Proposed Draft UAX #15: Unicode Normalization Forms
146 Suggested Restructuring of Text in Chapter 3
147 Proposed Deprecation of U+0673 ARABIC LETTER ALEF WITH WAVY HAMZA BELOW
148 Unicode 5.2.0 Beta
149 Proposed Update UTS #22: Unicode Character Mapping Markup Language (CharMapML)
Other Reports
Feedback on Encoding Proposals
Closed Public Review Issues


127 Proposed Update UAX #44: Unicode Character Database

UAX #44 Text Comments

Date/Time: Wed Jun 17 12:37:21 CDT 2009
Contact: corporate@khwilliamson.com
Name: Karl Williamson
Report Type: Public Review Issue
Subject: TR44 comments

Section 4.2.7 says that binary properties list only those code points whose value is True, which is omitted. This is not correct. DerivedCoreProperties.txt now contains several properties, like Is_Uppercase, which list the code points whose value is False, and the value isn't omitted. It is specified as 'No'. For symmetry, shouldn't it be specified as 'N'?

Section 5.1, Table 6. The DerivedAge.txt commentary is misleading. TR18 says that the actual property isn't as given in this file, but the union of all previous 'ages'. I suggest spelling this out, or at least refer to TR18. (The term 'Age' was a poor choice of word for the property, as it indicates what the file contains, as opposed to what the property really means. Perhaps you could consider adding an alias that is more correctly descriptive of the meaning. I haven't thought of a short name, but something conveying "as old or older than")

In Section 5.2, it says that DerivedNumeric(Type|Value).txt, are derived from UnicodeData.txt. This omits the fact that these two files also contain values derived from Unihan.txt (the derivations being defined in Table 6). This also implies that in case of mismatch, Unihan.txt also has precedence.

Section 5.6.1 says "Aliases for normative and informative properties defined in the Unihan data files are included in PropertyAliases.txt, beginning with Version 5.2." The beta version of this file only adds an alias for URS. I'm mostly unfamiliar with Unihan, so I'm guessing that this is the only thing in the db that is considered a property. (If that's what you were thinking, shouldn't e.g. kPrimaryNumeric be added as an alias for nt=nu in PropertyValueAliases?

In section 5.8.1 there is a typo in "The Canonical_Combining_Class and Decomposition_Mapping of a character are immutable, because of their important to the stability of the Unicode Normalization Algorithm". 'important' should be 'importance'

Date/Time: Wed Jun 17 13:57:43 CDT 2009
Contact: corporate@khwilliamson.com
Name: Karl Williamson
Report Type: Public Review Issue
Subject: TR44 comment on stabilization

I found the discussion of stabilized properties confusing. It seems to imply that for example, no new code points which are cased will be added to the standard, and that isn't true. It says in Section 5.1 about Column 1, " Properties marked as stabilized in the first column are no longer actively maintained, nor are they extended as new characters are added." And the properties dealing with casing are mostly marked as stabilized. Together this seems to me to say that no casing rules will be promulgated for new code points.

Date/Time: Wed Jun 17 14:54:24 CDT 2009
Contact: corporate@khwilliamson.com
Name: Karl Williamson
Report Type: Public Review Issue
Subject: TR44 ISO Comment field

The beta version of UnicodeData.txt has no entries in the ISO comment field. Either they should be added back or TR44 Section 7 changed to indicate that they have been removed.

UCD Comments

Date/Time: Sun Jun 7 16:08:42 CDT 2009
Contact: rcmuir@gmail.com
Name: Robert Muir
Subject: U+FDF2 Script property

Arabic Presentation Forms A - Word ligatures U+FDF2 (..) ARABIC LIGATURE ALLAH ISOLATED FORM

U+FDF2 is on the Dhivehi keyboard layouts: http://www.mcst.gov.mv/News_and_Events/xpfonts.htm

The script value in Scripts.txt for this codepoint is Arabic.

Is it possible for this value to be changed to Common for this codepoint?

Or instead, perhaps there could be an ISO 15924 script code that represents the composite Thaana + Arabic, although this definitely seems overkill.

The current situation is frustrating for software development, identifying runs of scripts in multilingual text, etc.

Date/Time: Wed Jun 10 22:15:24 CDT 2009
Contact: leob@mailcom.com
Name: Leo Broukhis
Report Type: Public Review Issue
Subject: Unicode beta 5.2.0

Two issues:

1. The index file
Index-5.2.0d1.txt 16-Sep-2008 14:53 145K
has not been rebuilt

2. U+23E8 DECIMAL EXPONENT SYMBOL property should be Sm, not So.

Date/Time: Mon Jun 15 13:30:23 CDT 2009
Contact: corporate@khwilliamson.com
Name: Karl Williamson
Report Type: Public Review Issue
Subject: Make NaN a real alias in PropertyValueAliases

I think there should be an entry like
nv ; NaN ; Not_A_Number

I've had to special case this in my code that reads this file since it is only in an #@missing comment here and in 5.1. Perhaps you did this because there is no official full name, yet in some version's UCD.html it did say "Not a Number" without the NaN. The other numeric property, ccc, has a default of 0, and it is specified in this file. By symmetry, this should be as well. And I think there should be a full name.

Date/Time: Wed Jun 17 17:53:53 CDT 2009
Contact: corporate@khwilliamson.com
Name: Karl Williamson
Report Type: Public Review Issue
Subject: New binary properties in PropertyValueAliases missing aliases

There are errors with some of the new property values in PropertyValueAliases.txt. I don't believe any of the new binary properties have the T and F, True and False aliases for them; and several, such as IUC have e.g., "No; No" instead of "N; No"

Date/Time: Mon Jul 13 11:39:53 CDT 2009
Contact: carl.oehlander@sap.com
Name: Carl Oehlander
Subject: Character U+24EA Reported as Neutral

Hi,

all other circled/parenthesized characters are reported as having an 'ambiguous' width:

U+2460 to U+24E9.

According to this file: http://www.unicode.org/Public/UNIDATA/EastAsianWidth.txt

The character "24EA;N # CIRCLED DIGIT ZERO", although it has the very same properties as its predecessors, is reported to have a 'neutral' width.

This doesn't make sense to me. Must be a bug? ICU is using this implementation and it is causing headaches.

Probably hasn't been noticed since these chars are hardly used.

Thank you, Carl

Date/Time: Mon Jul 13 14:58:37 CDT 2009
Contact: corporate@khwilliamson.com
Name: Karl Williamson
Subject: syntax errors in PropertyValueAliases.txt

Several weeks ago I submitted some errors and omissions in this file. Unlike other submissions, I didn't get a response from you noting that they had been received (an autoresponder would be helpful to me), and that file has since been updated without the corrections. I would expect that these corrections would be non-controversial and wouldn't have to be approved by a committee; I would also expect that you would want to make such corrections as soon as possible in the beta period, so that by its end you would have something that has been vetted by everyone. So I am submitting again. This time I am submitting what I believe should be the correct lines:

ICF; N        ; No                               ; F                                ; False
ICF; Y        ; Yes                              ; T                                ; True
ILC; N        ; No                               ; F                                ; False
ILC; Y        ; Yes                              ; T                                ; True
ITC; N        ; No                               ; F                                ; False
ITC; Y        ; Yes                              ; T                                ; True
IUC; N        ; No                               ; F                                ; False
IUC; Y        ; Yes                              ; T                                ; True

Date/Time: Mon Jul 13 19:01:54 CDT 2009
Contact: asmus@unicode.org
Name: Asmus Freytag
Report Type: Public Review Issue
Subject: Defective "@missing" directive

In DerivedNormalizationProps.txt there is a "@missing" directive which is defective

The current directive is

# @missing: 0000..10FFFF; <codepoint>

what was intended is

# @missing: 0000..10FFFF; NFKC_CF; <codepoint>

Reason: the file lists multiple properties. As stated, the default value <codepoint> cannot be unambiguously related to any of these properties.

The pattern for correct "@missing" directives for files with multiple properties is

# @missing: 0000..10FFFF; Is_NFKC_Casefold; Yes

which gives both a property alias and a value.

Date/Time: Mon Jul 20 13:00:39 CDT 2009
Contact: corporate@khwilliamson.com
Name: Karl Williamson
Report Type: Public Review Issue
Subject: Please don't create properties that begin with 'Is_'

The Perl 5 language's API uses 'Is_' prefixed to a binary property name as a synonym for that property. For example, in regular expressions, \p{Is_Uppercase} is the same thing as \p{Uppercase}, which means the same thing as \p{Uppercase=Y}. This API dates back to Perl's first support of Unicode, about a decade. Having an additional Is_Uppercase property, as proposed in 5.2, would lead to two different properties with the same name. Unicode encourages additional aliases for properties and property values, and Perl has done that.

I presume that the reason for this long-standing Perl practice is that intuitively something that is Uppercase "Is_Uppercase". (It's hard to describe the Uppercase property without using the verb "to be".) Having them mean different things is confusing because it runs contrary to the plain meaning of English, and it would be confusing even if Perl didn't have this construct.

And, the derivations of these properties indicate that, eg., 'Is_Uppercase' doesn't really mean what its name implies. A name, except for Orwellian uses, should clarify, not obfuscate.

It seems clear then, that no property or property value name should begin with 'Is_'. The name without the 'Is_' implies the 'is-ness' of the thing already.

Date/Time: Sun Jul 26 14:00:05 CDT 2009
Contact: corporate@khwilliamson.com
Name: Karl Williamson
Report Type: Public Review Issue
Subject: Out-of-order code points in ArabicShaping.txt

This is a small thing, but I thought I'd bring it to your attention. It appears that the convention in the UCD is that within each section of each file code points are listed in increasing ordinal order. There is one place where two code points are not in this order, and that is the final two code points in ArabicShaping.txt. 200D comes before 200C. Again, I don't know if it is your goal to keep code points in order or not, but this appears to violate it, if it is a goal.

No response is necessary.

Date/Time: Tue Jul 28 20:34:02 CDT 2009
Contact: freakrob@gmail.com
Name: Robert Abel
Report Type: Public Review Issue
Subject: Unicode 5.2.0 beta: U+1F190 description error

Note: I believe this has already fixed in the data. [ed]

The proposed description (on p. 600) for U+1F190 reads:

<square> 0040 @ 004A J

which should obviously read

<square> 0044 D 004A J

Regards,

Robert Abel

Date/Time: Tue Jul 28 22:09:21 CDT 2009
Contact: javier@khmeros.info
Name: Javier SOLA
Report Type: Public Review Issue
Subject: ZWSP as a word boundary

For 5.2 Beta

The chart for punctuation (U2000.pdf) says of code-point 200B (ZWSP):

"This character is intended for line break control;"

It should also include its property as a word boundary, as confirmed by the UTC this year. It could be rewritten as:

"This character is intended for invisible word separation and line break control."

Date/Time: Mon Aug 3 13:40:20 CDT 2009
Contact: corporate@khwilliamson.com
Name: Karl Williamson
Report Type: Public Review Issue
Subject: APIs shouldn't have to derive properties

In Beta 5.2, there are still some property value tables listed in PropertyValueAliases.txt which must be derived. These are the ones which have single-character general categories, like gc=L, and also gc=LC. Even though it is trivial to derive them, I believe this should be done once, centrally by Unicode. The rule should be that any property or property value exposed in this file should have a full listing of code points matching it in some file.

Date/Time: Mon Aug 3 14:33:16 CDT 2009
Contact: corporate@khwilliamson.com
Name: Karl Williamson
Report Type: Public Review Issue
Subject: The Aliases files should list all exposed properties

PropertyAliases.txt and PropertyValueAliases.txt should list all properties and non-numeric property values exposed in the UCD. Currently missing from PropertyAliases is Name_Alias. It doesn't appear that named sequences have a separate property name. I thought that Name_Alias wasn't listed because there is apparently no short form; but Math and Hyphen are listed without short forms either.

Missing from PropertyValueAliases.txt is NaN, which I wrote about earlier, proposing Not_A_Number as a long name for that, but if the UTC doesn't want to use a long name, it could still appear in this file like: nv; n/a ; NaN

I'm not saying that the numeric-only values in the ccc and nv properties should be listed. I think it's fine the way it is now.

Date/Time: Mon Aug 3 14:38:45 CDT 2009
Contact: corporate@khwilliamson.com
Name: Karl Williamson
Report Type: Public Review Issue
Subject: Documentation wrong about Age property

I'm submitting this to make sure it doesn't get lost. Here is an email from Ken Whistler: Karl Williamson noted:

>> > > Apparently that is what Asmus and others think as well,

Add me to that list.

>> > > and it certainly
>> > > is the data that comes in DerivedAge.txt,

And in the XML data derived from it, as well -- which is Eric's point, I think.

>> > > and if that were truly the
>> > > case, I wouldn't have any problem with the term "Age".

Well, then you're all set! ;-)

>> > > But let me quote
>> > > from the header of that file:
>> > > # Caution: When using the Age *property*, all assigned code points
>> > > # in each version are included, not just the newly assigned code points.
>> > > # For more information, see http://www.unicode.org/reports/tr18/
>> > >
>> > > And, if you look at tr18, it says:
>> > >
>> > > "
>> > > Caution: The DerivedAge data file in the UCD provides the deltas between
>> > > versions, for compactness. However, when using the property all
>> > > characters included in that version are included. Thus \p{age=3.0}
>> > > includes the letter a, which was included in Unicode 1.0. To get
>> > > characters that are new in a particular version, subtract off the
>> > > previous version as described in 1.3 Subtraction and Intersection. For
>> > > example: [\p{age=3.1} -- \p{age=3.0}]
>> > > "
>> > >
>> > > So either you guys are wrong, or the documentation is wrong in at least
>> > > two places.

The documentation is wrong in two places -- or at least misleading. Note that it doesn't actually say the property is *defined* thus and such, but rather that "when using the property all characters included in that version are included." That amounts to a pocket definition of a new derived property (or actually set of properties) based on the use of the Age property per se.

This is one of these cases where an insufficiently carefully documented property is trying to have it both ways.

Age is an enumerated property in the UCD. Among other things, that means that its values constitute a codespace partition. Each code point has one and and only one value of the property. Both the values in DerivedAge.txt and in the XML data files reflect that interpretation.

The property defined that way is not, however, as useful as the property described the way it is used for regex matches in UTS #18, because it is far more useful for regex matches to know if a character is included in Unicode Version X (or any *earlier* version), rather than to know if it was encoded exactly in Version X. So the usage of the Age property in UTS #18 just blithely assumes that interpretation, and the caution at the top of DerivedAge.txt reflects that interpretation, even though it is in direct contradiction with the data itself.

Note that there are no character properties in the UCD actually defined the way the Caution at the top of DerivedAge.txt currently implies Age is interpreted. If you think this through, for example, interpreted that way, U+0041 would have multiple Age property values: 1.1, 2.0, 2.1, 3.0, 3.1, 3.2, 4.0, 4.1, 5.0, 5.1, and soon, 5.2, because it would match a \p{age=n.n} expression for any of those values. Every character would continue to accumulate new Age values as future versions of the standard are published.

>> > > I have to assume that the documentation is right until
>> > > shown otherwise; and if it is correct, I think that proves my point. If
>> > > experienced people who work with Unicode all the time don't understand
>> > > what this property is, then something is wrong, and at a minimum a new
>> > > alias is needed to clarify things.

There is definitely need for clarification here.

>> > > I also don't think that in these days of abundant cheap storage that the
>> > > Consortium should be worrying about compactness.

Compactness is not the primary concern driving maintenance of UCD properties (and files) by the way.

>> > > I believe every
>> > > property that is exposed in the UCD should have a fully derived version
>> > > available, probably in the extracted directory. In 5.2 Beta, the only
>> > > properties and property values that the user has to derive (except for
>> > > defaults) are Age, gc=LC, gc=C, gc=L gc=M, gc=N, gc=P, gc=S, and gc=Z.

However, none of those are actually property values per se. They are certainly not *extracted* values.

Each of those is a different kind of derived property value.

So gc=L (which I assume you meant, rather than "gc=LC") is actually not a value of General_Category proper at all, but rather the union of the set of characters with five different values:

(gc=Lu) | (gc=Ll) | (gc=Lt) | (gc=Lo) | (gc=Lm) While it is certainly easy to derive such sets from the data, it is also perfectly reasonable to ask for pre-derived listings of such derived values in the UCD. It would be up to the UTC to decide whether the extra work to maintain additional derived values for each release is worth the benefit in such cases. Note that ICU provides a generic Unicode set notation that makes it trivial to construct such sets.

Also, regarding "Age", what you are asking in this case would be not *one* derived property, but rather a distinct derived binary property for *each* Unicode version. I.e.:

Included_In_Version_1_1 --> (Age=1.1)

Included_In_Version_2_0 --> (Age=1.1) | (Age=2.0)

Included_In_Version_2_1 --> (Age=1.1) | (Age=2.0) | (Age=2.1)

Included_In_Version_3_0 --> (Age=1.1) | (Age=2.0) | (Age=2.1) | (Age=3.0)

etc., etc., for each succeeding version.

IMO, it isn't actually worth the effort to define and maintain such a list of derived property values (or equivalently, just the sets of characters, without actually *naming* the properties they assume), when the derivations are so trivial based on the existing DerivedAge.txt file. This is especially true for that particular file, because all you have to do is delete all the entries below the Age of concern, and the entries above it define your set in question. No programming necessary. :-)

>> > > There should be files in the extracted directory that show the derived
>> > > values for all of them. There are bound to be mistakes made when
>> > > programmers re-derive them; and there is duplicated work as well. This
>> > > Age property is a case in point. I wonder how many implementations >> > > there are out there that have it wrong.

Not too many, I would wager -- since most of them would be using one or the other of the two interpretations, and would have picked the one they wanted to accomplish what they were after. It is rather unlikely that there are many applications out there using an interpretation "all characters included in Version 3.0", but which are then blindly using Age=3.0 values from DerivedAge.txt, ignoring all the characters with Age=1.1, 2.0, or 2.1, for example.

--Ken

Date/Time: Mon Aug 3 15:07:28 CDT 2009
Contact: corporate@khwilliamson.com
Name: Karl Williamson
Report Type: Public Review Issue
Subject: more on the Is_... properties

I understand that there have been a number of comments unhappy with the new property names exposed in 5.2 that begin with 'Is_'. I myself have commented previously about them. I think I proposed some possible other names, and I've seen some others also proposed. My comment now is that it would be better to not expose these properties in the 5.2 UCD, but to delay until such time as there is a consensus acceptance of better names; I'm not sure there is time to do this before the scheduled 5.2 availability date.

Date/Time: Fri Jul 31 00:51:21 CDT 2009
Contact: jamadagni@gmail.com
Name: Shriramana Sharma
Subject: Make 0903 DEVANAGARI SIGN VISARGA independent

Hello.

I wish to request for the general category of the character 0903 DEVANAGARI SIGN VISARGA to be changed from Mc to Lo because there are many sequences where it is necessary to display a separate visarga:

a. See: http://sanskritweb.net/yajurveda/tb-3-06.pdf page 2 line 5 from bottom (excluding footer) and same document page 3 line 6 from top. Here visarga follows the Vedic sign SPACING CHANDRABINDU (which is not yet encoded, IIRC).

The PDF document clearly shows that it has been impossible to render the VISARGA properly.

b. Same document, page 1, end of first Devanagari line shows: 0924 TA + 0951 + 0903. Here the typesetter of the files has used an independent visarga to typeset this properly as said by him at http://sanskritweb.net/itrans/itmanual2003.pdf page 131. Otherwise, the combination 0951 + 0903 is not rendered properly in Unicode applications like MS Word. This should actually not be a problem seeing that 0951 is category Mn, and 0903 is also of category M (Mc), and so both marks should coexist peacefully. But there is some problem here and an independent visarga should solve it.

Therefore I request you to kindly change the general category of the character 0903 DEVANAGARI SIGN VISARGA from Mc to Lo in Unicode 5.2.0 at least.

128 Proposed Update UTS #37: Unicode Ideographic Variation Database

No feedback was received via the reporting form this period.

133 Proposed Draft UTS #46: Unicode IDNA Compatible Preprocessing

No feedback was received via the reporting form this period.

134 Proposed Update UAX #9: Unicode Bidirectional Algorithm

Date/Time: Wed Jun 24 00:38:13 CDT 2009
Contact: matial@il.ibm.com
Name: Matitiahu Allouche
Report Type: Technical Report or Tech Note issues
Subject: comments about UAX#9 - tr9-20

In section 3 "Basic Display Algorithm", there is the sentence: <quote>Reordering. The text within each paragraph is reordered for display. Once the text in the paragraph is broken into lines, the resolved embedding levels are used to reorder the text of each line for display.</quote>

IMHO, it is not immediately clear that the first sentence is an introduction which is detailed by the second sentence. I suggest the following rephrasing:

<replacing text>Reordering. The text within each paragraph is reordered for display: first, the text in the paragraph is broken into lines, then the resolved embedding levels are used to reorder the text of each line for display.</replacing text>

Date/Time: Wed Jun 24 00:46:41 CDT 2009
Contact: matial@il.ibm.com
Name: Matitiahu Allouche
Report Type: Technical Report or Tech Note issues
Subject: comments about UAX#9 - tr9-20

In Table 4 "Bidirectional Character Types", the General Scope entry for BN mentions "bidi controls". Since the most important bidi controls (LRM, RLM, LRE, RLE, LRO, RLO, PDF) are not of type BN, I suggest to replace "bidi controls" by "some bidi controls".

Date/Time: Wed Jun 24 00:59:21 CDT 2009
Contact: matial@il.ibm.com
Name: Matitiahu Allouche
Subject: comments about UAX#9 - tr9-20

The file BidiTest.txt mentioned in section 4.4 "Bidi Conformance Testing" does not seem to be available anywhere on the Unicode site.

135 Proposed Update UAX #11: East Asian Width

No feedback was received via the reporting form this period.

136 Proposed Update UAX #14: Unicode Line Breaking Algorithm

See also L2/09-263, L2/09-265

Date/Time: Sun Aug 2 12:50:12 CDT 2009
Contact: asmus@unicode.org
Name: Asmus Freytag
Report Type: Public Review Issue
Subject: UAX#14 proposed 5.2.0 update

In section 1, there should be mention of the new section 10.

The wording of the new rule LB30 is ambiguous. The text of the rule refers to "closing punctuation" but the class in the rules is not CL but CP.

To fix, replace "punctuation" by "parenthesis" or make similar changes. As both the terms, when coupled with the word "closing" are now the long name of an LB class, the text needs to be very explicit about whether these terms are used as a long name for an LB class or in their customary sense.

There are possibly other instances of this in the text, so it's worth a check.

I see that a large number of characters have been removed from the listing of EX. If the intent is to make these typical specimens, then the list should include the compatibility characters. The choice to include these in EX isn't otherwise obvious "by analogy".

137 Proposed Update UAX #24: Unicode Script Property

See also L2/09-255

No (other) feedback was received via the reporting form this period.

138 Proposed Update UAX #29: Unicode Text Segmentation

No feedback was received via the reporting form this period.

139 Proposed Update UAX #31: Unicode Identifier and Pattern Synta

Date/Time: Tue Jun 23 15:48:48 CDT 2009
Contact: cowan@ccil.org
Name: John Cowan
Report Type: Public Review Issue
Subject: Updating UAX #31

1) References to "XML 1.1" should be augmented with references to "XML 1.0 5th Edition or later", as they have the same rules for identifiers, and 1.0 5e is expected to be more widely implemented.

2) The Inherited script code is now Zinh in ISO 15942. References to Qaai here and elsewhere should be replaced for Unicode 5.2.

Date/Time: Wed Jun 24 02:39:41 CDT 2009
Contact: gihan@icta.lk
Name: Gihan Dias
Report Type: Public Review Issue
Subject: Re: Proposed Update Unicode Standard Annex #31

I went over the above document (Revision 10 dated 2009-06-22) and have the following comments.

1. in Section 2.3 B, the name of the script has been changed as follows:

"For example, the Sinhalese word"

The accepted name of the language and the script is "Sinhala", the use of the term "Sinhalese" is depreciated. Therefore, please revert to the original name, Sinhala:

"For example, the Sinhala word"

2. Section 2.3 Layout and Format Control Characters

Considering the importance of these characters in the correct representation of text, I consider that the language of this section to be too weak.

I propose that any implementation which supports a given script (e.g. Sinhala) *must* support the use of layout and format control characters needed for that script. It should *not* otherwise claim conformance to UAX31. We then also need to specify a list of scripts for which join control characters should be supported.

My proposed text is appended below.

Regards,

Gihan Dias ICT Agency of Sri Lanka

Certain Unicode characters are known as Default_Ignorable_Code_Points. These include variation selectors and control-like characters used to control joining behavior, bidirectional ordering control, and alternative formats for display (having the General_Category value of Cf). The recommendation is to permit them in identifiers only in special cases, listed below. The use of default-ignorable characters in identifiers is problematical because the effects they represent are normally just stylistic or otherwise out of scope for identifiers. It is also possible to misapply these characters such that users can create strings that look the same but actually contain different characters, which can create security problems. In such environments, identifiers should also be limited to characters that are case-folded and normalized with NFKC. For more information, see Section 5, Normalization and Case and UTR# 36: Unicode Security Considerations [UTR36].

For these reasons these characters are normally excluded from Unicode identifiers. However, visible distinctions created by certain format characters (particularly the Join_Control characters) are necessary and make necessary distinctions in certain languages. A blanket exclusion of these characters makes it impossible to create identifiers based on certain words or phrases in those languages. Identifier systems which allow the following scripts should allow these characters, but limited to particular contexts where they are necessary.

Scripts: Sinhala [:script=Sinh:] , Malayalam [:script=Mlym:] ...

...
...

Thus for such circumstances, an implementation shall allow the following Join_Control characters, in the limited contexts as specified in A1, A2, and B below:

140 Proposed Update UAX #34: Unicode Named Character Sequences

No feedback was received via the reporting form this period.

141 Proposed Update UAX #38: Unicode Han Database (Unihan)

See also L2/09-257

Date/Time: Fri May 22 18:56:06 CDT 2009
Contact: tagbox@gmail.com
Name: Tiaan Geldenhuys
Subject: Unihan database definition entry for U+65E0

Hello,

The Unihan Database's kDefinition entry for character U+65E0 incorrectly states that it is "KangXi radical 7", while this should probably read "KangXi radical 71" instead.

Regards,
Tiaan Geldenhuys.

Date/Time: Thu Jul 16 00:01:08 CDT 2009
Contact: asmus@unicode.org
Name: Asmus Freytag
Report Type: Public Review Issue
Subject: Filename vs. comments (Unihan)

The partial Unihan files have internal comments like this:

# DictionaryIndices.txt
# Date: 2009-06-07 09:36:08 UDT [JHJ]

however, their actual filenames in the zip file are like this:

Unihan_DictionaryIndices.txt

This is inconsistent with other usage where filename and comment agree.

I personally would prefer to have this resolved such that *both* filename and comment contain the Unihan_ prefix. (My parser looks for the string "Unihan" in the comments to verify that a user is opening a file with Unihan file layout, and it would be nice if that kept working.)

The more important reason to retain the "Unihan_" is that othewise names like "NormativeProperties.txt" could lead to misunderstandings, while Unihan_NormativeProperties.txt is perfectly clear.

PS: I would like, at this point, to request that the edcomm adopt a policy for future file naming whereby *all* UCD files use *unique* filenames that are not dependent on the directory structure within a single version of the UCD. In other words, it should be possible to "flatten" the UCD and still avoid filename collisions.

Date/Time: Wed Jul 29 15:11:02 CDT 2009
Contact: VYV03354@nifty.ne.jp
Name: Masatoshi Kimura
Report Type: Public Review Issue
Subject: Possible error of IRG source mappings

All IRG source mappings for compatibility characters are removed from Unihan 5.2.0 Beta although they remained in CJKC_SR.TXT.

For example, CJKC_SR.TXT contains IRG TSource for U+2F800. (from CJKC_SR.TXT attached to ISO/IEC 10646:2003 Amd.5:2008)

> > 2F800;04E3D;T6-2936;;;;;

But it is no longer present in Unihan 5.2.0 Beta. (from 5.1.0-5.2.0.unihan.changes.diffs)

> > kIRG_TSource 2F800 '6-2936' -> ''

They should be restored if it is not intentional.

142 Proposed Update UAX #41: Common References for Unicode Standard Annexes

No feedback was received via the reporting form this period.

143 Proposed Update UTS #10: Unicode Collation Algorithm

No feedback was received via the reporting form this period.

144 Proposed Update UAX #42: Unicode Character Database in XML

No feedback was received via the reporting form this period.

145 Proposed Update UAX #15: Unicode Normalization Forms

No feedback was received via the reporting form this period.

146 Suggested Restructuring of Text in Chapter 3 for Clarification of Unicode Normalization

No feedback was received via the reporting form this period.

147 Proposed Deprecation of U+0673 ARABIC LETTER ALEF WITH WAVY HAMZA BELOW

Subject: ADD to public review feedback on PRI #47
Date: Fri, 31 Jul 2009 15:19:16 -0700
From: Deborah W. Anderson <dwanders@sonic.net>

The following email provides information on the use of U+0673 ARABIC LETTER ALEF WITH WAVY HAMZA BELOW in Balochi, which is one of the two languages listed in the annotation for U+0673. (This information is pertinent as the PRI had requested data on how widespread usage of this character is.)

Debbie Anderson

------ Forwarded Message
From: <ebashir@uchicago.edu>
Date: Sun, 21 Jun 2009 01:59:35 -0500 (CDT)
Subject: Re: Kashmiri in Unicode

I have never seen aleph with wavy hamza below used for Balochi. Balochi has a vowel system very like that of Urdu or Persian. I'm fairly certain about this.

Elena

E. Bashir, Ph.D.
Dept. of South Asian Languages and Civilizations University of Chicago 1130
E. 59th St., #214 Chicago, IL 60637

 

148 Unicode 5.2.0 Beta

See also L2/09-254, L2/09-263, L2/09-265

Date/Time: Sat May 23 13:31:32 CDT 2009
Contact: kent.karlsson14@comhem.se
Name: Kent Karlsson
Subject: Terminology inconsistencies regarding cursive joining in TUS5/UCD.

The terminology in TUS 5 and in the UCD regarding cursive joining (shaping) is not fully consistent, and several terms are used for the same concept. In some occasions, the wrong concept is used too. See below for details.

----------

ArabicShaping.txt, DerivedJoiningType.txt, PropertyAliases.txt:
joining type (4 instances), shaping class (1 instance),
Joining_Type

TUS 5, chapter 8: joining class (20 instances), joining type
(2 instances), shaping class (1 instance)

For consistency of terminology, I think all instances of "joining class" and all instances of "shaping class" should be replaced by "joining type". Though "shaping class" might have been a better term and "joining class" is more dominant in number of instances, Joining_Type is a formalised property name and hence the term used should be "joining type". But see below for another related point concerning two tables in TUS 5.

----------

ArabicShaping.txt, DerivedJoiningGroup.txt, PropertyAliases.txt:
joining group (1 instance), Joining_Group

TUS 5, chapter 8: joining group (4 instances), shaping group (1 instance)

For consistency of terminology, I think the one instance of "shaping group" should be replaced by "joining group". Though "shaping group" might have been a better term, Joining_Group is a formalised property name and hence the term used should be "joining group".

-----------

Table 8-10 and table 8-12 of TUS 5: These say to introduce three new "joining classes" (i.e "joining types"), but they actually seem to introduce one new joining type "Alaph-joining". However, "Alaph-joining" is not used in ArabicJoining.txt (which may be a separate mistake, but that goes beyond this terminology note), and the Afj, Afn, and Afx seem to be "glyph types" (as in table 8-5) and should have X instead of A and the fj, fn, and fx should be subscripts.

The text "with the addition of three extra classes that determine the behavior of final alaphs" (just before table 8-10) is thus wrong. It is one extra "class" (joining type, alaph-joining), and three extra "glyph types". Even with these corrections, the descriptions still seem a bit odd, but that is beyond this terminology note.

Date/Time: Thu Jun 18 09:08:31 CDT 2009
Contact: umavs@ca.ibm.com
Name: V.S. Umamaheswaran
Subject: Error in description of Table 5-3 EBCDIC column

In the description below Table 5-3 in Unicode 5.0 the following paragraph appears:

"Table 5-3 shows that there are two mappings of LF and NEL used by EBCDIC systems. The first EBCDIC column shows the MVS Open Edition (including Code Page 1047) mapping of these characters. That mapping arises from the use of the LF character as “New Line” in ASCII-based Unix environments and in some data transfer protocols that use the Unix assumptions in an EBCDIC environment. The second column shows the Character Data Representation Architecture (CDRA) mapping, which is based on the standard definitions—both in ASCII and in EBCDIC—of LF."

The CDRA definition (see: http://www-01.ibm.com/software/globalization/cdra/appendix_g.jsp#Header_228) is actually in the FIRST EBCIDC column (not the Second). The MVS Open Edition based on Unix systems, and based on EBCDIC C assumption of NEL as the end of line had a customization done for swapping LF and NEL to be used in that environment only.

Also you may want to add a pointer to the URL for IBM CDRA in the references section .. http://www.ibm.com/software/globalization/cdra/index.jsp

Hello Rick, Lisa:

I had posted an erratum to a para re: table 5-3 in Unicode 5.0.

I had further feedback on the topic ... "The term MVS OpenEditon was deprecated 8 years ago. Please use z/OS Unix System Services in its place."

Not sure where to send this to...

Could you please forward this to whoever is looking after errata ..

Thanks .. Best regards, Uma

149 Proposed Update UTS #22: Unicode Character Mapping Markup Language (CharMapML)

Note: This has already fixed in the data. [ed]

Date/Time: Thu Jun 18 17:30:16 CDT 2009
Contact: me@gsnedders.com
Name: Geoffrey Sneddon
Subject: UTS #22 Charset Alias Matching causes ambiguity

A few charset aliases become ambiguous when the charset alias matching rules in UTS #22 are applied: iso-ir-9-1 and iso-ir-91, and iso-ir-9-2 and iso-ir-92.

Other Reports

Note: This report on Alif or Ain is an issue for SEI, not for UTC. [ed]

Date/Time: Tue Jul 14 11:04:25 CDT 2009
Contact: pcole@udel.edu
Name: Peter Cole
Report Type: Other Question, Problem, or Feedback
Subject: Alif or Ain?

"Example 2: You have a language with a glottal stop, and you're using a Latin-based orthography. Choosing Arabic "ain" as the character for glottal stop would not be advisable. Instead, use one of the various half-rings or other letters for glottal stop."

It would also not be advisable because it is "alif" and not "ain" that represents glottal stop in Arabic. It would be advisable to substitute "alif" for "ain" in your example. Your point is well taken, of course.

Date/Time: Mon Aug 3 15:00:48 CDT 2009
Contact: corporate@khwilliamson.com
Name: Karl Williamson
Report Type: Public Review Issue
Subject: Clarify Stability policy

The property and property value stability policies are misleading in that they say that a property or property value will never be removed, but could be strongly deprecated as an option for the future, but don't spell out the possibility that the UTC could also render them useless by removing all code points from them. The wording is such that such an eventuality is not ruled out, but it should be spelled out. It just would not occur to most readers that this could happen, and should it happen, it is a gotcha that they need to be prepared for.

I would not know of this possibility except through analysis of successive versions of Unicode and discovering it happened. It has happened with sc=hrkt, and appears to be happening in 5.2 with ISO_comment, and arguably happened with ccc=ATBL (although the last case was apparently the result of a typo in the UCD in which the numeric value didn't change but the alias did).

As a developer, I would actually prefer that these empty properties be removed, so that I would find out early that I have to change something, instead of having the code silently give incorrect results.

Feedback on Encoding Proposals

Note: This is feedback on Amd 8. [ed]

Date/Time: Wed May 27 19:21:43 CDT 2009
Contact: leob@mailcom.com
Name: Leo Broukhis
Report Type: Feedback on an Encoding Proposal
Subject: N3626 errata

On proposed character U+1F41D SNAKE, the descriptive note "zodiac 6" is missing; also, possibly, on U+1F41F BOAR, "zodiac 12" is missing.

Closed Public Review Issues

No feedback was received via the reporting form this period.