L2/08-075

Comments on Public Review Issues
(October 13, 2007 - January 30, 2008)

The sections below contain comments received on the open Public Review Issues as of January 30, 2008, since the previous cumulative document was issued prior to UTC #113 (October 2007).

Contents:

102 Proposed Update to UAX #15: Unicode Normalization Forms
103 Proposed Update to UAX #29: Text Boundaries
104 Proposed Update to UAX #31: Identifier and Pattern Syntax
105 Proposed Update to UAX #14: Line Breaking Properties
108 Ideographic Variation Database Submission
109 Proposed Draft UTR #42: An XML representation of the UCD
110 Proposed Update to UAX #24 Script Names
111 Proposed Update to UTS #18 Unicode Regular Expressions
112 Proposed Update to UAX #9 Unicode Bidirectional Algorithm
113 Proposed Update to UTS #10 Unicode Collation Algorithm
114 Proposed Update to UAX #34 Unicode Named Character Sequences
115 Proposed Update to UTR #36 Unicode Security Considerations
117 Proposed Update to UAX #38 The Unicode Han Database (Unihan)
118 Proposed Update to UAX #44 Unicode Character Database (includes data file reports)
Other Reports
Feedback on Encoding Proposals
Closed Public Review Issues


102 Proposed Update to UAX #15: Unicode Normalization Forms

Date/Time: Mon Jan 28 14:05:12 CST 2008
Contact: arthur.reutenauer@normalesup.org
Name: Arthur Reutenauer
Opt Subject: Comments about 5.1.0 drafts of UAX #15 and UAX #34

Hello,

I have comments about the two UAXes mentioned in the subject (UAX #15 Normalization and UAX #34 Named Sequences), and also one remark about UCD.html. They are mostly slight mistakes (in my opinion) or typos.

UAX #15 Unicode Normalization Forms

I reviewed the most recent version I could find, revision 28, draft 5 (www.unicode.org/reports/tr15/tr15-28.html), tagged "Unicode 5.1.0".

In section 1, there are two links to section 13 "Programming Language Identifiers", whose content has been moved to UAX #31: although the relevant document can easily be found, it may be more significant to point readers directly to it, instead of going via the (now empty) section 13. These two links are, respectively, in the indented paragraph entitled Note ("Text exclusively containing, etc."), and the fore-to-last paragraph ("Normalization Forms KC and KD, etc."), shortly before subsection 1.1 starts.

In section 4, shouldn't the last word of C5 be changed to "annex" like in C4? It reads "document" for the moment.

In section 7, the reference to the Normalization Charts should be changed from [Charts] to [Charts15].

In section 14, when quoting from DerivedNormalizationProps.txt, the annex uses an older format of that file: in Unicode 4, there were only two fields, and lines looked like "0338;NFC_MAYBE". This has now changed to "0338;NFC_QC;M" and it seems advisable to update to the current formatting (two lines would have to be changed).

In section 16, middle of paragraph 3, the end of the second sentence should read "... into an L jamo plus a V jamo." instead of "... plus a T jamo".

Finally, in subsection 21.1, the example for the state of the buffer to be processed uses a colored background (red) to highlight a particular element. This helps when reading the annex on a browser, but doesn't render very well when printed in black and white (I printed the annex to proofread it, and the background color is simply invisible); it could be advisable to find another device if the annex is to be printed, maybe using a gray shade like has been done for figure 7 in section 5.

[See section below for Arthur's feedback on UAX #34 -- Ed.]

Date/Time: Wed Jan 30 16:35:22 CST 2008
Contact: max.rabkin@gmail.com
Name: Max Rabkin
Opt Subject: Typo in UAX#15

The sentence "The LV syllables themselves decompose into an L jamo plus a T jamo" in section 16 (Hangul) of UAX#15 appears to be incorrect.

"...L jamo plus a T jamo" should be "...L jamo plus a V jamo".

103 Proposed Update to UAX #29: Text Boundaries

No feedback was received via the reporting form this period.

104 Proposed Update to UAX #31: Identifier and Pattern Syntax

Date/Time: Wed Dec 5 19:20:37 CST 2007
Contact: kenw@sybase.com
Name: Ken Whistler
Opt Subject: UAX #31, Mongolian

The proposed update for UAX #31 (PRI #104) contains an inconsistency in Section 2.2 regarding the wording for special context handling for Mongolian.

This inconsistency is the result of incomplete correction of the text following the discussion of what characters were required in identifiers for Mongolian.

The introductory text (2nd bullet) calls out U+202F and U+180E MONGOLIAN VOWEL SEPARATOR for special treatment and calls them "Mongolian separators".

But the actual context provided under C. calls out U+202F and U+180B..U+180D, the Mongolian free variation selectors.

I believe the latter was the final consensus on what was to be included for Mongolian identifiers (although the discussion should be checked on this). In any case, the text needs to be corrected for the inconsistency, and if the Mongolian free variation selectors are the correct choice, then the class of characters involved should probably not be called "Mongolian separators" in the text.

Date/Time: Fri Jan 18 21:50:55 CST 2008
Contact: msd@pobox.com
Name: Michael D'Errico
Opt Subject: UAX #31

In UAX #31 Identifier and Pattern Syntax, identifiers have the restriction that they can not begin with a number. While this is current practice in many programming languages, I don't think it is desirable. I am actually writing a new computer language where I plan to allow a leading digit -- the identifier would be ok as long as it doesn't match the number specification. For example, you could have "const double 2pi = 6.2831853;" instead of having to resort to pi_2 or similar.

One other place where the rules were relaxed to allow leading digits is in Internet domain names, e.g. 3com.com, which were originally not allowed.

Mike

105 Proposed Update to UAX #14: Line Breaking Properties

Feedback from Andy Heninger has been put into a separate document. No other feedback was received via the reporting form.

108 Ideographic Variation Database Submission

Date/Time: Thu Nov 22 23:15:08 CST 2007
Contact: y.naoi@glamour.co.jp
Name: NAOI Yasushi
Opt Subject: Comment on PRI 108

I will offer a proposal as follows.


The sequence for CID=13866 should be removed. Because CID=2638 and CID=13866 are not unified by the unification rules given in Annex S, ISO/IEC 10646.


The sequence for CID=20114 should be removed. Because CID=2098 and CID=20114 are not unified by the unification rules given in Annex S, ISO/IEC 10646.


The sequence for CID=12869 should be removed. Because CID=12869 is a glyph for 'ruby' tag. I guess that such a glyph is out of scope of IVS.


The sequence for CID=14187 should be removed. Because CID=6162 and CID=14187 are not unified by the unification rules given in Annex S, ISO/IEC 10646.


The sequence for CID=13747 should be removed. Because CID=2360 and CID=13747 are not unified by the unification rules given in Annex S, ISO/IEC 10646.


The sequence for CID=14226 should be removed. Because CID=6815 and CID=14226 are not unified by the unification rules given in Annex S, ISO/IEC 10646.


The sequence for CID=13912 should be changed to use U+9054 in spite of U+9039 as the base character.


The sequence for CID=20240 should be removed. Because CID=7041 and CID=20240 are not unified by the unification rules given in Annex S, ISO/IEC 10646.


The sequence for CID=20255 should be removed. Because CID=1241 and CID=20255 are not unified by the unification rules given in Annex S, ISO/IEC 10646.


From: suzuki toshiya mpsuzuki@hiroshima-u.ac.jp
Date: 2007-11-25 02:57:25 -0800
Subject: Re: New Public Review Issue: #108 Ideographic Variation Database Submission

According to Adobe TechNote #5078 p. 96 clause #9, the glyph for the variation sequence VS19-12869 used for U+6CE8 (partial chart p. 11) is a glyph introduced for "ruby" typesetting, not for standard CJK ideograph. In plain text of Unicode, the ruby may be coded by the interlinear annotations (U+FFF9 - U+FFFB), so the relationship between using VS19-12869 and using interlinear annotations should be clarified: VS19-12869 should be used with or without the interlinear annotations?

In addition, as I commented for PRI #98, many glyphs introduced for radical exemplifications are assigned variation sequences as the ideograph variant. I guess the variation sequences for these glyphs are NOT intended to be unique character encoding for these glyphs, it just provides the variation sequences from the side of Unified CJK ideographs. If my guess is right, is there any reason to ignore the ideographic glyphs in Adobe TechNote #5078 p. 160 from 16285 to 16298? Adobe TechNote #5078 does not comment explicitly, I guess these glyphs are introduced for Kanbun (U+3190-U+319F).

Date/Time: Sun Nov 25 20:05:10 CST 2007
Contact: vunzndi@vfemail.net
Name: John Knightley
Opt Subject: Pri108

To Unicode and Adobe,

during the public review period it has become very clear that some of the suggested sequences in pri108 are incorrect, and that to continue with pri108 in it's present form would lead to immediate problems.

Most accute is the problem of CID+14089 which in FPAMD5 is proposed to be encoded at U+9FC4 (see 02n3982.zip fpdam5-all.pdf page 5 ) but in the adobe IVes is U+6881 vs018.

Even if this problem is resolved a further adobe 10 characters are in www.cse.cuhk.edu.hk/~irg/irg/irg29/IRGN1380_UNC.pdf a list of 254 characters fast tracted by the IRG which a verdict by the IRG within the next 12 months.

2 characters approved for submission to the IRG by the UTC at the request of Adobe, namely:-

  CID+13866 pri108 U+52E2 vs 18 is UTC00857 (IRGN1380 #022)
  CID+20240 pri108 U+943A vs 18 is UTC00872 (IRGN1380 #207) (also has J-source JH-JTBDE1)

And a further 8 or 9 characters submitted by Japan

  CID+13780 pri108 U+4ECA vs18 JH-004890 (IRGN1380 #008)
  CID+20114 pri108 U+5EA7 vs18 JH-IB1783 (IRGN1380 #070)
  CID+20117 pri108 U+5FA1 vs18 JH-IB0680 (IRGN1380 #074)
  CID+14064 pri108 U+687A vs18 JH-JTB314 (IRGN1380 #095)
  CID+13723/4 pri108 U+2363A vs18/9 JH-IB2148 (IRGN1380 #099)* 
  CID+15393 pri108 U+2363A vs18 JH-JTC0EB (IRGN1380 #100)*
  CID+20150 pri108 U+6A9C vs18 JH-JTB398 (IRGN1380 #102)
  CID+20201 pri108 U+83DF vs18 JH-JTB989 (IRGN1380 #171)
  CID+13651 pri108 U+885E vs18 JH-JTBAFD (IRGN1380 #181)

*a separate encoding for one of the above could make two pri108 IVes wrong.

If the above IVes are not change then many of the above characters will be displayed incorrectly whenever the default ignore is used.

To leave pri108 as it is should not be an option.

The above list resticts itself to those IVSes raised at IRG #29 hosted by Adobe earlier this month.

There are other IVes that are suspect, these have been communicated separately to Adobe.

Yours sincerely John Knightley

109 Proposed Draft UTR #42: An XML representation of the UCD

No feedback was received via the reporting form this period.

110 Proposed Update to UAX #24 Script Names

No feedback was received via the reporting form this period.

111 Proposed Update to UTS #18 Unicode Regular Expressions

No feedback was received via the reporting form this period.

112 Proposed Update to UAX #9 Unicode Bidirectional Algorithm

Date/Time: Thu Jan 24 03:36:25 CST 2008
Contact: matial@il.ibm.com
Name: Matitiahu Allouche
Opt Subject: update to UAX #9

I have a few remarks about revision 18 of UAX#9 (dated 2008-01-10).

1) In section 2.1 "Explicit Directional Embedding", the new sentence 'On web pages, these characters should be replaced by using the dir attribute with the values dir="ltr" or dir="rtl"' seems to refer only to RLE...PDF which have been mentioned right before, which makes the value dir="ltr" look strange.

2) In the last paragraph of section 3.3.4, we find: "commas are not considered part of the number because they are not surrounded on both sides". It would be clearer to write "because they are not surrounded on both sides by digits".

3) The last sentence of section 3.3.4 (before the example) says: "However, if there is an adjacent left-to-right sequence, then European numbers will adopt that direction". I think that "adjacent" here is not correct and should be replaced by "preceding".

4) In the new text for L2 in section 3.4 "Reordering Resolved Levels", "the the" => "the"

5) Same thing in the first sentence of section 4 "Bidirectional Conformance"

113 Proposed Update to UTS #10 Unicode Collation Algorithm

Date/Time: Mon Jan 14 16:49:33 CST 2008
Contact: mike@saxonica.com
Name: Michael Kay
Opt Subject: UTS #10: definition of minimal match

In UTS #10 section 8, the definition of minimal match states:

The match is minimal if for all positive i and j, there is no match at Q[s+i,e-j]. In such a case, we also say that P minimal matchs at Q[s,e].

I think it is probably intended that there should also be no match at Q[s+i, e] or at Q[s, e-j]: that is, only one of i and j need be positive, the other must be non-negative. The rule could be expressed as:

The match is minimal if for all i, j such that i>=0 and j>=0 and not(i=j=0), there is no match at Q[s+i,e-j]. In such a case, we also say that P minimal matchs at Q[s,e].

(I'll leave the grammatical question of "minimal matchs" vs "minimally matches" to editorial discretion!)

114 Proposed Update to UAX #34 Unicode Named Character Sequences

Date/Time: Mon Jan 28 14:05:12 CST 2008
Contact: arthur.reutenauer@normalesup.org
Name: Arthur Reutenauer
Opt Subject: Comments about 5.1.0 drafts of UAX #15 and UAX #34

Hello,

I have comments about the two UAXes mentioned in the subject (UAX #15 Normalization and UAX #34 Named Sequences), and also one remark about UCD.html. They are mostly slight mistakes (in my opinion) or typos.

[see other section above for Arthur's UAX #15 feedback -- Ed.]

UAX #34 Unicode Named Character Sequences

I used revision 6, http://www.unicode.org/reports/tr34/tr34-6.html

I have only one item of feedback about this annex: table 2 shows a few named character sequences, but uses a wrong name for number 4 and 5: sequence <17B6, 17C6> is called KHMER VOWEL SIGN AAM according to NamedSequences.txt, not KHMER VOWEL SIGN SRAK AM; likewise for <17BB, 17C6>. While this table should of course not be taken as a normative part, and even if there is little chance that anyone would do so, I think it would be better to correct the two names.

[See other section for Arthur's UCD.html feedback -- Ed.]

That is all I had to report; I hope you will find these comments useful, even if most of them are rather mundane.

Yours,

Arthur Reutenauer

115 Proposed Update to UTR #36 Unicode Security Considerations

Date/Time: Tue Oct 30 13:33:51 CST 2007
Contact: asmus@unicode.org
Name: Asmus Freytag (via Rick)
Opt Subject: PRI #115, UTR #36 feedback

I won't call this an objection, but I do like to point out that the bullet

" * Unicode could have avoided using ZWJ and ZWNJ with virama, but at the expense of having "cloned" virama characters with different characteristics. But even had that been done, the cases where a joiner had no visual effect would be the same cases where the clones would all look the same. Thus using cloned viramas would not have avoided the security issues. "

strikes me as unusually argumentative for a formal publication. Also, 'clones' are suddenly alluded to without having been explained - most people who are not seasoned character encoders won't be able to follow that shorthand.

If I was to make a suggestion, I'd recommend you try the text entirely without the bullet. It's fine, focuses on what people need to know and not on the "could/would/should, if only" type of alternate reality.

Cheers, A./

Date/Time: Tue Oct 30 16:11:43 CST 2007
Contact: verdy_p@wanadoo.fr
Name: Philippe Verdy
Opt Subject: UTR#36-6 outdated IDN data

The UTR contains outdated data about some registries (that have not registered their supported sets at IANA per language, as suggested in the IETF documentation).

For example, the Swiss registry (.ch) accepts the French oe ligature (œ) since end of 2005, but it is missing in the table.

There may be other missing characters in this table, as well as missing info about other registries that have published their own list of supported characters, without registering them at IANA using the HTML template.

Also, for verification purpose, the HTML registered data is not easy to use in automated process (this requires parsing the HTML document, in the home that the first <pre> element in the <body> contains the data in the documented format; this is possible but the format does not guarantees that this will work as the <pre> section is not explicitly tagged with a unique ID...

And anyway, other vendors (such as registrars) have recreated their own updatable IDN data using a easier format (for example with preparsed XML or binary data with minimum overhead for automated processing).

Date/Time: Wed Oct 31 00:50:10 CST 2007
Contact: duerst@it.aoyama.ac.jp
Name: Martin Dürst
Opt Subject: TR 36, UTF-8

Editorial: Section 3.1 should be split up properly into two subsections, as there are now two exploits.

Material: In what's currently Section 3.1.1, I'd like to see some advice on cases where there is more than one illegal byte, or where a second, third, or fourth byte in a byte sequence makes that byte sequence as a whole illegal.

117 Proposed Update to UAX #38 The Unicode Han Database (Unihan)

No feedback was received via the reporting form this period.

118 Proposed Update to UAX #44 Unicode Character Database

Date/Time: Sat Nov 3 13:43:19 CST 2007
Contact: kent.karlsson14@comhem.se
Name: Kent Karlsson
Opt Subject: UnicodeData-5.1.0d8.txt

Looking through UnicodeData-5.1.0d8.txt, I have (so far) the following comments:

These have bidi category inconsistent with general category (cmp. 0B43 and 0D62):

0B44;ORIYA VOWEL SIGN VOCALIC RR;Mn;0;L;;;;;N;;;;;
0B62;ORIYA VOWEL SIGN VOCALIC L;Mn;0;L;;;;;N;;;;;
0B63;ORIYA VOWEL SIGN VOCALIC LL;Mn;0;L;;;;;N;;;;;
0D63;MALAYALAM VOWEL SIGN VOCALIC LL;Mn;0;L;;;;;N;;;;;

These seem more correct, given the glyph positioning for both Oriya vocalic L, LL, R, RR, and Malayalam vocalic LL:

0B43;ORIYA VOWEL SIGN VOCALIC R;Mn;0;NSM;;;;;N;;;;;
0D62;MALAYALAM VOWEL SIGN VOCALIC L;Mn;0;NSM;;;;;N;;;;;

These too have bidi category inconsistent with general category (suggested changes after the dataline, given glyph positioning):

1929;LIMBU SUBJOINED LETTER YA;Mc;0;NSM;;;;;N;;;;;		NSM → L
192A;LIMBU SUBJOINED LETTER RA;Mc;0;NSM;;;;;N;;;;;		NSM → L
192B;LIMBU SUBJOINED LETTER WA;Mc;0;NSM;;;;;N;;;;;		NSM → L
A802;SYLOTI NAGRI SIGN DVISVARA;Mc;0;NSM;;;;;N;;;;;		Mc → Mn
These have questionable g.c. and bidi category (cmp. 0D44):
0D41;MALAYALAM VOWEL SIGN U;Mn;0;NSM;;;;;N;;;;;
0D42;MALAYALAM VOWEL SIGN UU;Mn;0;NSM;;;;;N;;;;;
0D43;MALAYALAM VOWEL SIGN VOCALIC R;Mn;0;NSM;;;;;N;;;;;

This one seems more correct, given the glyph positioning for both Malayalam U, UU, vocalic R, and RR:

0D44;MALAYALAM VOWEL SIGN VOCALIC RR;Mc;0;L;;;;;N;;;;;

Date/Time: Mon Oct 22 19:37:50 CDT 2007
Contact: chuck.caldarale@unisys.com
Name: Chuck Caldarale
Opt Subject: 5.0 and 4.1 code chart problem with 203E

Depending on the Adobe Reader 8.1 magnification setting, the OVERLINE character (203E) may not display in either the 5.0 or 4.1 chart (page 2 of U2000.pdf and U41-2000.pdf, respectively). If the magnification is 64% or less, the symbol appears; at 65% or larger (!), the symbol disappears. The same issue is in the text on page 5 of each .pdf file, but at a different magnification setting. Although this may really be an error in Adobe Reader, the disappearing symbol problem does not occur in the 4.0 chart (U40-2000.pdf) when used with the same Reader level. We have not tried other levels of Adobe Reader, nor any versions of Adobe Acrobat.

We discovered this after encountering a discrepancy in the Sun ISO 8859-8 translation tables that were reimplemented in Java SE 5, and suspect the Sun programmer may have been misled by the invisible graphic in the later Unicode tables when viewed on a large monitor.

Thanks for your consideration.

- Chuck

Date/Time: Wed Nov 21 21:00:09 CST 2007
Contact: bms061000@utdallas.edu
Name: Benjamin Scarborough
Opt Subject: Possible error in Unicode 5.1 beta files.

As of UnicodeData-5.1.0d10.txt, the general categories of U+A789 and U+A78A are given as Sk. Based on L2/06-259R, the document in which the characters were proposed, they should have a general category of Lm, as they are used orthographically as letters, not symbols, to indicate meaning and tone. They also occur in word-medial position.

From: "James Kass" thunder-bird@earthlink.net
Date: 2007-11-23 07:31:39 -0800
Subject: Ol Chiki character name typo?

Is there a typo in the names for OL CHIKI?

1C78 ᱸ OL CHIKI MU TTUDDAG
1C79 ᱹ OL CHIKI GAAHLAA TTUDDAAG
1C7A ᱺ OL CHIKI MU-GAAHLAA TTUDDAAG

Should U+1C78 be TTUDDAAG like 1C79 and 1C7A?

N2984.PDF also shows TTUDDAG for 1C78.

Searching the file N2984.PDF shows five instances of "TTUDDAG" and eleven instances of "TTUDDAAG". The name of U+1C79 is OL CHIKI GAAHLAA TTUDDAAG, but in the text it also appears as both GAAHLAA TTUDDAAG and as GAAHLAA TTUDDAG.

Quoting from N2984.PDF,

"The vowel modifier <© > GAAHLAA TTUDDAAG 1C79 (åè©ñéè© §ô†è©å) ga˘hla˘ t.ud. a˘g [g´hl´Tu∂´k’] follows ä 1C5A a, è 1C5F a¯ , and û 1C6F e. In the sources consulted, I have found all three: ä© o˘ [O] , è© a˘ [´] , and û© e˘ [E] ."

Well, copy/pasting didn't work out so well, but it appears, based on the Ol Chiki words in the text, that U+1C78 should be named OL CHIKI MU TTUDDAAG.

Best regards,

James Kass

P.S.: Re-typing the above quotation gives us:

"The vowel modifier <ᱹ> GAAHLAA TTUDDAAG 1C79 (ᱜᱟᱹᱦᱞᱟᱹ ᱴᱩᱰᱟᱹᱜ) găhlă ṭuḍăg [ɡəhlə ʈuɖəkʼ] follows ᱚ 1C5A a, ᱟ 1C5F ā, and ᱯ 1C6F e. In the sources consulted, I have found all three: ᱚᱹ ŏ [ɔ] , ᱟᱹ ă [ə] , and ᱮᱹ ĕ [ɛ] ."

(Note that 1C6F in the quoted paragraph and its associated glyph should probably be U+1C6E ᱮ OL CHIKI LETTER LE.)

Date/Time: Fri Nov 23 11:45:15 CST 2007
Contact: bms061000@utdallas.edu
Name: Benjamin Scarborough
Opt Subject: Script of combining Latin letters

As of Scripts-5.1.0d21.txt, the following ranges of characters:

U+0363..U+036F (COMBINING LATIN SMALL LETTER A..COMBINING LATIN SMALL LETTER X),
U+1DCA (COMBINING LATIN SMALL LETTER R BELOW), and
U+1DD3..U+1DE6 (COMBINING LATIN SMALL LETTER FLATTENED OPEN A ABOVE..COMBINING LATIN SMALL LETTER Z)

all have a script of "Inherited." As they are specifically combining Latin letters--something even explicitly stated in the character names; they also collate as a tertiary difference from their non-combining counterparts--they should have a script of "Latin."

It is also worth noting that the combining Cyrillic characters at U+2DE0..U+2DFF have already been given a script of "Cyrillic."

Date/Time: Thu Nov 29 22:36:31 CST 2007
Contact: bms061000@utdallas.edu
Name: Benjamin Scarborough
Opt Subject: Combining classes of U+1DCE and U+1DD0

As of UnicodeData-5.1.0d10.txt, U+1DCE (COMBINING OGONEK ABOVE) and U+1DD0 (COMBINING IS BELOW) have combining classes of 230 (Above) and 220 (Below) respectively. Based on the proposal document for the characters (and the name of U+1DCE), the combining classes of these two characters should be changed to 214 (Attached_Above) and 202 (Attached_Below).

Date/Time: Thu Nov 29 22:16:48 CST 2007
Contact: bms061000@utdallas.edu
Name: Benjamin Scarborough
Opt Subject: Script of U+1DD2 COMBINING US ABOVE

As of Scripts-5.1.0d21.txt, U+1DD2 COMBINING US ABOVE has a script of 'Inherited.' It appears, however, to be a combining version of U+A76F LATIN SMALL LETTER CON. If this is indeed the case, it should have a script of 'Latin.' (Assuming, that is, that the scripts of the other COMBINING LATIN LETTERs are also changed as previously noted.)

Date/Time: Sat Dec 8 02:32:30 CST 2007
Contact: bms061000@utdallas.edu
Name: Benjamin Scarborough
Opt Subject: Script of COMBINING GREEK YPOGEGRAMMENI

As of Scripts-5.1.0d21.txt, the script of U+0345 COMBINING GREEK YPOGEGRAMMENI is given as 'Inherited,' but it should be changed to 'Greek.' While this character is indeed used exclusively with the Greek script, the more pressing reason to change its script value is that it case folds to U+03B9 GREEK SMALL LETTER IOTA. By having different script values, the possibility exists that a case-folded string will have different script values than the original.

Date: Mon, 10 Dec 2007 11:24:44 -0800 (PST)
From: Kenneth Whistler kenw@atlantis-new.sybase.com
Subject: Re: FW: Subj: Script of COMBINING GREEK YPOGEGRAMMENI
To: bms061000@utdallas.edu

Benjamin,

This is not a new situation. U+0345 has been Script=Inherited since Scripts.txt was introduced in Unicode 3.1 in March, 2001. It isn't clear that there has been a problem from the script values in the casefolding of Greek strings -- particularly since the strings in question, Greek characters involving iota sub/adscript have long been known to require special case processing, anyway.

In any case, this is not an error for a new character in the beta review. You can always take up the issue with the UTC and request a change, but this would be a change in a longstanding property value -- not something newly introduced in Unicode 5.1. And it would need to be handled by the UTC in terms of a specific proposal for a change in an existing property value.

Regards,

--Ken Whistler

Date/Time: Mon Dec 10 19:11:42 CST 2007
Contact: henrik@theiling.de
Name: Henrik Theiling
Opt Subject: Capital sharp s

Hi!

I think I spotted a bug in the 5.1.0beta, and Kenneth Whistler just confirms my suspicion and told me I should report it via this form for proper processing and archiving.

It is about casing of the new 1E9E LATIN CAPITAL LETTER SHARP S.

My understanding was that with existing text, nothing changes, i.e.,

toupper("ß") = "SS" casefold("ß") = "ss"

and that the new capital sharp S would behave as follows:

tolower("<CAPITAL SHARP S>")= "ß" casefold("<CAPITAL SHARP S>")= "ss"

However, there is no entry in the CaseFolding list nor in SpecialCasing for 1E9E in the Unicode 5.1.0beta. The following document does propose an entry in CaseFolding:

http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3227.pdf

**Henrik

Date/Time: Mon Dec 17 00:39:06 CST 2007
Contact: bms061000@utdallas.edu
Name: Benjamin Scarborough
Opt Subject: Name Aliases for U+01C2 and U+01C3?

The name of U+01C2 is "LATIN LETTER ALVEOLAR CLICK," but the official IPA charts identify this character as representing a "palatoalveolar click." As the character name seems to be a significant misrepresentation of its identity, I propose giving U+01C2 the name alias "LATIN LETTER PALATOALVEOLAR CLICK."

In addition, the name of U+01C3 is "LATIN LETTER RETROFLEX CLICK," but the official IPA charts identify this character as a "(post)alveolar click." Again, as the official name seems to be a significant misrepresentation of the identity of the character, I propose giving U+01C3 the name alias "LATIN LETTER POSTALVEOLAR CLICK" since "LATIN LETTER ALVEOLAR CLICK" is already used for U+01C2.

These characters are not referenced in UTN #27.

The IPA table which identifies these characters is http://www.arts.gla.ac.uk/IPA/nonpulmonic.html

From: "N. Ganesan" naa.ganesan@gmail.com
Date: 2007-12-17 20:43:23 -0800
Subject: [indic] Re: Atomic chillus should not form conjuncts.

http://www.unicode.org/versions/Unicode5.1.0/#Significant_Character_Additions

The Tables 4a, 4b & 4c alternative encoding (shown in yellow color) imposes new properties on Malayalam Virama.

Because in Indian scripts, virama is just an inherent /a/ deleter from akshara consonants, allowing the alternative encoding will not be desirable. I hope UTC drops the alternative encoding (Yellow color columns in Tables 4a, 4b & 4c) since it racdically alters virama properties.

N. Ganesan

Date/Time: Mon Jan 21 19:51:17 CST 2008
Contact: markus.icu@gmail.com
Name: Markus Scherer
Opt Subject: Unicode 5.1 beta: DerivedNumericValues.txt third field redefined

I am concerned about the changing syntax of DerivedNumericValues.txt:

This file was new in Unicode 3.2.0 where it had three fields per data row, and the third field was the numeric type. This was very handy. http://www.unicode.org/Public/3.2-Update/extracted/DerivedNumericValues-3.2.0.txt

The format was unchanged in Unicode 4.0.0.

In Unicode the third field was removed, breaking parsers that used it. I believe one of the rationales was that all of the extracted/ files only used two fields (code point range and value). http://www.unicode.org/Public/4.1.0/ucd/extracted/DerivedNumericValues.txt

Unicode 5.0.0 kept it that way.

Now in Unicode 5.1.0 there is a third field again, but now it's not the numeric type as in Unicode 3.2 & 4.0 but it's the numeric *value* in a different form (integer or fraction, copied from UnicodeData.txt). http://www.unicode.org/Public/5.1.0/ucd/extracted/DerivedNumericValues-5.1.0d10.txt

This field could be useful, but adding it as the third field when older versions of this file had a third field with a different type of data seems like a mistake. If there are UCD parsers that work with both old and new data files, they will break again and will get unnecessarily complicated.

I propose that the original third field (numeric type) be reinstated and this new data be added as a new fourth field. Alternatively, the new data could be removed (leaving the current 2-field format).

markus

Date/Time: Fri Jan 25 16:49:09 CST 2008
Contact: petercon@microsoft.com
Name: Peter Constable
Opt Subject: Unicode 5.1 beta: chillus

This is a comment on the proposed descriptive text for TUS5.1 wrt Malayalam chillu character additions.

"Chillu characters never start a word." If a user enters a string in which (e.g.) a chillu character is preceded by a space, then that string will contain a word starting with a chillu character. Please change to say "In Malayalam-language text, chillu characters never start a word."

Date/Time: Mon Jan 28 14:05:12 CST 2008
Contact: arthur.reutenauer@normalesup.org
Name: Arthur Reutenauer
Opt Subject: Comments about 5.1.0 drafts of UAX #15 and UAX #34

[see other sections for Arthur's UAX #15 and UAX #34 feedback -- Ed.]

UCD.html

I have only read the parts of the Unicode Character Database that concern Normalization and Named Sequences, and found one very small mistake in these parts: in section "UCD Property Files", in the entry about DerivedNormalizationProps.txt, it mentions NFD_Quick_Check, NFKD_Quick_Check, etc. as property names, whereas the name used in the actual database, and in other places in UCD.html, is in fact NFD_QC, etc. as already mentioned above.

Other Reports

Date/Time: Sun Nov 25 17:21:15 CST 2007
Contact: oa223@cam.ac.uk
Name: Øistein E. Andersen
Opt Subject: Confusing formulation in Table 3-6 in Unicode 5.0

Dear Sir or Madam,

The last line of Table 3-6 on page 103 in the Unicode 5.0 Standard reads as follows:

000uuuuu zzzzyyyy yyxxxxxx 1110uuu 10uuzzzz 10yyyyyy 10xxxxxx

I would have expected (starting from the right) 6 x's, 6 y's, 6 z's and 3 u's. This way, UTF-8 byte number 2 (from the left) would contain only z's, rather than a confusing mix of z's and u's.

The line would then read as follows:

000uuuzz zzzzyyyy yyxxxxxx 1110uuu 10zzzzzz 10yyyyyy 10xxxxxx

This would also be consistent with the table on en.wikipedia.org/wiki/UTF-8.

Yours faithfully, Øistein E. Andersen

Date: Fri, 7 Dec 2007 18:02:09 +0530
From: "Mahesh T. Pai" paivakil@gmail.com
Subject:[indic] Atomic chillus should not form conjuncts.

'Rick McGowan' said on Thu, Dec 06, 2007 at 09:40:07AM -0800,:

> http://www.unicode.org/versions/Unicode5.1.0/#Significant_Character_Additions

Did a quick review of the above text, and here are the first thoughts.

I think the tables suggest use of atomic chillus to form conjuncts and rephas.

IMHO, this is likely to bring in ambiguity. Since chillus are a different and special representation of the dead consonant, I suggest that the standards should specify that the atomic character should be used only when the chill representaion is required (except for /nta/).

For /nta/, though a chillu form of n is shown, the underlying sequences should be n + virama + rra.

This rule will eliminate a whole lot of possible ambiguities and make life a lot more easier for everybody - users and developers.

-- Mahesh T. Pai http://paivakil.blogspot.com/  It's not the software that's free; it's you.

Date/Time: Tue Dec 11 14:28:37 CST 2007
Contact: abysta@yandex.ru
Name:
Opt Subject: Hooks and descenders in Abkhaz letters

Hello!

Please, read this discussion first http://www.unicode.org/udhr/n/notes_abk.html

In modern Abkhaz letters "ghe" and "pe" with middle hooks (ҕ, 0495; ҧ, 04A7) are used very rarely (mainly in headlines in illustrated magazines).These letters used to be used in the past. See: http://www.unicode.org/udhr/n/abk/abkhaz03.png  http://www.unicode.org/udhr/n/abk/abkhaz04.png

Nowadays we consistently use "ghe" and "pe" with descenders, but Unicode Character Code Charts have only letters with middle hooks.

Is it possible to submit a proposals to Unicode for inclusion of letters "ghe" and "pe" with descenders in the Unicode Standard (alongside with same letters with middle hooks)?

Note that "ghe" with descender (ӷ, 04F7) is already encoded for Yupik.

See modern Abkhaz alphabet: http://www.unicode.org/udhr/n/abk/abkhaz06.png  http://www.heku.ru/datas/users/66-alphabet.gif

Best regards!

Date/Time: Sun Dec 16 13:54:30 CST 2007
Contact: rscook@socrates.berkeley.edu
Name: Richard Cook
Opt Subject: 20202

I think there's a glyph error in the codechart glyph for [U+20202]. By the radical assignment in the codechart itself, the top of the glyph should be [U+201a2] rather than [U+516b]. Or, the radical assignment leading to the codechart position is incorrect. Since the only mapping is K-source "4-0019", this is impossible to decide.

Date/Time: Sun Dec 16 12:12:53 CST 2007
Contact: gerald.pardoen@free.fr
Name: Gérald Pardoën
Opt Subject: CJK extension B error

I'm working since 10 years on the Unicode Database, Recently I found an error on the Extension B of the CJK part (U+214FA code). The character in the CNS 11643 (第 15字面,屬戶政字, CNS : 15-674A 戶政 EUC : 8EAFE7CA) and the character in the grid index of the Unicode Character database (U+214FA ) are different. Originaly I think they must be the same...

I have a good methode to track the mistake.

Sincerely yours

G. Pardoën

From: Kenneth Whistler kenw@sybase.com
Date: 2008-01-04 16:54:36 -0800
To: rick@unicode.org
Subject: Re: Unicode 5.1, Egyptian Transliteration, and Fonts

Rick,

Could you add this to the pile of beta feedback, so we don't lose it? I'm thinking that something like #1 would be reasonable, although we shouldn't start down the road of doing #2.

--Ken

------------- Begin Forwarded Message -------------

Date: Fri, 4 Jan 2008 22:20:05 +0100
From: Karl Pentzlin karl-pentzlin@acssoft.de
Subject: Re: Unicode 5.1, Egyptian Transliteration, and Fonts

Am Freitag, 30. November 2007 um 19:26 schrieb Saqqara:

S> ... The EGYPTOLOGICAL YOD is still an unresolved point ...

Is it, as outcome of the whole discussion thread until now, advisable to propose the following:

1. To add informative notes in the printed standard to: U+0485 COMBINING CYRILLIC DASIA PNEUMATA · usually displays left of the base letter when applied to a capital letter · also for use with Latin letters U+0486 COMBINING CYRILLIC PSILI PNEUMATA · usually displays left of the base letter when applied to a capital letter · also for use with Latin letters · forms the "egyptological yod" when applied to U+0049/U+0069

2.) To add the two following "named character sequences": U+0049 U+0486; LATIN CAPITAL LETTER EGYPTOLOGICAL YOD U+0069 U+0486; LATIN SMALL LETTER EGYPTOLOGICAL YOD

- Karl Pentzlin

------------- End Forwarded Message -------------

Feedback on Encoding Proposals

Date/Time: Sun Nov 4 14:07:28 CST 2007
Contact: kent.karlsson14@comhem.se
Name: Kent Karlsson
Opt Subject: composite Hangul Jamo proposed additions

Comments regarding proposed Hangul Jamo additions (WG2/N3315, L2/07-285).

Assuming that name changes are still possible (if not, an alias of some sort should be added):

11FD HANGUL JONGSEONG KIYEOK-KHIEUK should be renamed HANGUL JONGSEONG KIYEOK-KHIEUKH (missing H)

A96E HANGUL CHOSEONG RIEUL-KHIEUK should be renamed HANGUL CHOSEONG RIEUL-KHIEUKH (missing H)

A973 HANGUL CHOSEONG PIEUP-KHIEUK should be renamed HANGUL CHOSEONG PIEUP-KHIEUKH (missing H)

A976 HANGUL CHOSEONG IEUNG-RIEUL should be renamed HANGUL CHOSEONG YESIEUNG-RIEUL (the leading letter is a yesieung, not a ieung)

A977 HANGUL CHOSEONG IEUNG-HIEUH should be renamed HANGUL CHOSEONG YESIEUNG-HIEUH (the leading letter is a yesieung, not a ieung)

Since I think that it should be possible to compose composite Hangul Jamos and Hangul Letters from the left, adding one letter (or vowel filler) at the time on the right, the following characters need to be allocated to make that possible (possible code positions given, without rearranging current characters in the pipeline [I do not oppose rearranging, if that is still possible]):

A97D  HANGUL CHOSEONG KIYEOK-SIOS   [1100, 1109]
A97E  HANGUL CHOSEONG RIEUL-THIEUTH   [1105, 1110]
A97F  HANGUL CHOSEONG RIEUL-PHIEUPH   [1105, 1111]
D7C7  HANGUL CHOSEONG NIEUN-PANSIOS   [1102, 1140]
D7C8  HANGUL CHOSEONG RIEUL-KIYEOK-SIOS   [A964, 1109] [1105, 1100, 1109]
D7C9  HANGUL CHOSEONG RIEUL-PIEUP-SIOS   [A969, 1109] [1105, 1107, 1109]
D7CA  HANGUL CHOSEONG RIEUL-PANSIOS   [1105, 1140]
D7FC  HANGUL CHOSEONG RIEUL-YEORINHIEUH   [1105, 1159]
D7FD  HANGUL CHOSEONG MIEUM-PANSIOS   [1106, 1140]

Date/Time: Sun Jan 6 08:47:10 CST 2008
Contact: cowan@ccil.org
Name: John Cowan
Opt Subject: L2/07-413, Oriya Fraction Signs

I believe that these Oriya Fraction Signs are essentially the Bengali currency fractions we already have, with some additions that should be encoded. I have no comment on whether the graphically distinct Oriya versions should be encoded, but certainly the non-distinct versions should not.

The author says that the two cases are semantically different because the Bengali signs refer to currency only, the Oriya to general base-16 fractions. But this difference is not really significant, because the Bengali signs were used for the old Indian currency in which 1 rupee = 16 annas. What's more, I bet if someone looked they would see general fraction use in Bengali too, though this is frankly speculative.

Specifically, the author calls out two irreconcilable differences, names and numeric properties. There is no problem with the names, of course: "names are but names", and if an Oriya fraction is named BENGALI whatever, then so it is. Unicode is full of these.

However, I think it would be good to change the numeric property of the Bengali characters from the current integer values to corresponding fractions: thus NUMERATOR TWO becomes 1/8 (since it actually means 2/16).

Date/Time: Tue Jan 22 15:31:47 CST 2008
Contact: cowan@ccil.org
Name: John Cowan
Opt Subject: L2/08-034

I support this proposal.

I further would support an initiative to add specific variant sequences using the variant selectors for all of these glyph variants.

Date/Time: Tue Jan 22 15:48:01 CST 2008
Contact: cowan@ccil.org
Name: John Cowan
Opt Subject: L2/08-030

I support the proposal to encode the subscript-10 for Algol 60 purposes. I believe that the Algol 60 hobbyist community (which currently uses either E/e, in violation of the lexical rules of Algol 60, or else # which is very unnatural) will be pleased to see this character encoded.

Date/Time: Tue Jan 22 15:49:41 CST 2008
Contact: cowan@ccil.org
Name: John Cowan
Opt Subject: L2/08-029

I support this proposal, except that although U+003B might be a Greek question mark, neither U+FE14 nor U+FF1B (presentation forms) could be, and so they need not be excluded. If they are excluded solely for the sake of consistency, however, I have no problem with it.

Date/Time: Tue Jan 29 14:18:55 CST 2008
Contact: cowan@ccil.org
Name: John Cowan
Opt Subject: L2/08-044

I believe that it is a grave mistake to unify the Old South Arabian digit 1 with the word separator. Not having a distinct word-separator character makes life too difficult for algorithmic word division in OSA text (for purposes of analysis rather than rendering). Although it's true that numbers are delimited from running text by the number mark, there can be no guarantee that all inscriptions discovered and undiscovered follow this rule strictly.

Please communicate this concern to the authors, and if need be to the UTC. Thanks.

Closed Public Review Issues

No feedback was received via the reporting form this period.