Public Review Issues

Accumulated Feedback on PRI #249

This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.

Date/Time: Fri Feb 1 17:53:58 CST 2013
Contact: [email protected]
Name: Richard Wordingham
Report Type: Error Report
Opt Subject: Uppercasing and Canonical Equivalence


TUS 6.2 Section 5.19 contains the untruth, 'Casing operations as 
defined in Section 3.13, Default Case Algorithms, preserve canonical 
equivalence, but are not guaranteed to preserve Normalization Forms.'

The counterexample is given by canonically equivalent NFC 
<U+1FB3 GREEK SMALL LETTER ALPHA WITH YPOGEGRAMMENI, U+0359 
COMBINING ASTERISK BELOW> and NFD <U+03B1 GREEK
SMALL LETTER ALPHA, U+0359, U+0345 COMBINING GREEK YPOGEGRAMMENI<,
which capitalise to the inequivalent <U+0391 GREEK CAPITAL LETTER 
ALPHA, U+0399 GREEK CAPITAL LETTER IOTA, U+0359> and 
<U+0391, U+0359, U+0399<.  Recall that U+0359 COMBINING ASTERISK 
BELOW was added to Unicode on the basis of its use in citing damaged Greek text.

Date/Time: Tue Mar 12 04:37:39 CDT 2013
Contact: [email protected]
Name: Sean Burke
Report Type: Problems / Feedback about website
Opt Subject: UNIDATA does not include latest ucdxml


At present, there does not seem to be a method for fetching the latest 
UCD in XML data without manually determining which is the latest published 
version. Would it be possible to include the ucdxml files in the 
/Public/UNIDATA directory, or provide some other means of automatically 
determining which is the latest?

Date/Time: Mon Mar 18 14:13:33 CDT 2013
Contact: [email protected]
Name: Richard Wordingham
Report Type: Error Report
Opt Subject: StandardizedVariants.html


The enclosing circle for <24C2, FE0F> is missing.  It has been missing
since the variant was introduced in Unicode 6.1.0 and and is missing in the
current draft, StandardizedVariants-6.3.0d12.html.  I presume this is simply a
glyph error, and not am intended feature of the variant.

Date/Time: Tue Apr 30 17:38:47 CDT 2013
Contact: [email protected]
Name: Markus Scherer
Report Type: Public Review Issue
Opt Subject: Unicode 6.3 Case_Ignorable changes


DerivedCoreProperties.txt says:

# Derived Property:   Case_Ignorable (CI)
#  As defined by Unicode Standard Definition D136
#  C is defined to be case-ignorable if
#    Word_Break(C) = MidLetter or MidNumLet, or
#    General_Category(C) = Nonspacing_Mark (Mn), Enclosing_Mark (Me), Format (Cf), Modifier_Letter (Lm), or Modifier_Symbol (Sk).

In the Unicode 6.3 beta, U+0027 Apostrophe changes from WB=MB (MidNumLet) to
WB=SQ (Single_Quote) which causes it to become not Case_Ignorable any more.
This disrupts titlecasing, resulting in "Don'T" rather than "Don't", for
example. Please adjust the Case_Ignorable derivation to include the Apostrophe
again. (Probably list it explicitly, rather than WB=SQ.)

Also, note that U+003A Colon changed from WB=ML (MidLetter) to WB=XX (Other)
which makes the Colon also not Case_Ignorable any more. I assume that this is
intended, together with its titlecasing behavior change, according to the
review note in draft UAX #29 which says "Dropping colon from the middle of
words."

I reviewed the spec and implementation some more. It looks like the titlecasing 
change of "Don't" is due to our word break code not being updated yet, not due 
to the Case_Ignorable change.

It looks like Case_Ignorable is *only* used in the Final_Sigma definition 
(Unicode 6.2 chapter 3 table 3-14), so this change affects the lowercasing 
of a Greek sigma adjacent to an apostrophe.

I think we probably want to keep the apostrophe Case_Ignorable anyway.

Feedback above this line was reviewed at the May 2013 UTC meeting.

Date/Time: Sat Jun 1 03:41:30 CDT 2013
Contact: [email protected]
Name: Roozbeh Pournader
Report Type: Error Report
Opt Subject: Lam-Alef ligatures are not really obligatory


The Core Specification, in Section 8.2, under Arabic Ligatures (pp. 258-259 in
6.2), says "Certain types of ligatures are obligatory in Arabic script
regardless of font design" and then continues to describe Lam-Alef ligatures.

This is actually not true: there are some common modern Arabic fonts that do
not have any Lam-Alef ligatures, and it doesn't look odd or weird at all.
Perhaps the most famous example is the Yekan font, used both on the web and
for UIs:

* UI example (Arabic language):
	http://static.iphoneruler.net/images/screenshots/normal/fonts/yekan-6.jpg

* A screenshot of a web page (Persian language):
	http://www.unicode.org/versions/Unicode6.2.0/ch08.pdf

There are fonts of similar design that don't have the ligature either. (Note:
these are not considered non-mainstream, indeed parts of books, book titles,
ads, articles, and web pages are done in such fonts.)

The language should probably be changed to say that such ligatures are very
common, and they should work in all combinations if they work for some: for
example, if <Lam, Alef> ligates in the font/rendering engine, so should <Lam
with 3 dots above, Alef wasla>.

Date/Time: Fri Jun 14 13:32:30 CDT 2013
Contact: [email protected]
Name: Andy Heninger
Report Type: Error Report
Opt Subject: Mongolian Word Boundary


I am forwarding this report from [email protected], who submitted 
it as a bug to ICU, http://bugs.icu-project.org/trac/ticket/10212

Quote:

Please don't handle NARROW_NO_BREAK_SPACE (202F) and
MONGOLIAN_VOWEL_SEPARATOR as whitespaces! These characters are
initially thought to handle mongolian words properly. NNBS is
connector for suffixes and MVS is connector for final case of vowel A
and E. For instance: arG(MVS)a (ᠠᠷᠭ᠎ᠠ) is a one word arad(NNBS)un(ᠠᠷᠠᠳ
ᠤᠨ) is also one word. There is now also one problem with arad(NNBS)un
because u of "un" shown as initial form. It should be midial form of
u. If these words tokenized as separate like "arG" and "a", "arad" and
"un" then there exist massive problems for spellcheker software. Word
counter would also work incorrect. You can test it with OpenOffice.
These characters should by handled as formatter character.

thanks, Badral

Image file showing the problem:
http://bugs.icu-project.org/trac/attachment/ticket/10212/test.png

// Comments below from CEW are corrected to replace earlier comments.

Date/Time: Sun Jun 16 17:03:20 CDT 2013
Contact: [email protected]
Name: C. E. Whitehead
Report Type: Public Review Issue
Opt Subject: Unicode 6.3 Beta -- these are my comments but revised as the previous submission had some errors

Unicode 6.3 Beta
> these are my comments but revised as the previous submission had some errors
and the new one still had errors, so these have been revised again.

* ENTIRE ORIGINAL COMMENTS BELOW, WITH REPAIRS *

First, I have a few comments on section 8.2

First in reply to Roozbeh,

Date/Time: Sat Jun 1 03:41:30 CDT 2013
Contact: [email protected]
Name: Roozbeh Pournader
Report Type: Error Report
Opt Subject: Lam-Alef ligatures are not really obligatory

> > The Core Specification, in Section 8.2, under > Arabic Ligatures (pp. 258-259 in
> > 6.2), says "Certain types of ligatures are
> > obligatory in Arabic script
> > regardless of font design" and then continues > to describe Lam-Alef ligatures.
> > This is actually not true: there are some
> > common modern Arabic fonts that do
> > not have any Lam-Alef ligatures, and it
> > doesn't look odd or weird at all.
> > Perhaps the most famous example is the Yekan
> > font, used both on the web and
> > for UIs:
> > * UI example (Arabic language):
> > http://static.iphoneruler.net/images/screenshots/normal/fonts/yekan-6.jpg
> > * A screenshot of a web page (Persian
> > language):
> > http://www.unicode.org/versions/Unicode6.2.0/ch08.pdf
> > There are fonts of similar design that don't
> > have the ligature either. (Note:
> > these are not considered non-mainstream,
> > indeed parts of books, book titles,
> > ads, articles, and web pages are done in such > fonts.)
> > The language should probably be changed to say > that such ligatures
are very common, and they > should work in all combinations if they work
> > for some: for
> > example, if ligates in the font/rendering
> > engine, so should .
> {MY COMMENT: Roozbeh's comment is now displaying correctly for me;
Roozbeh's comment is about joining alif-lam and lam-alif, mentioned in 8.2, page 259;
Rules L1, L2, and L3
> . . .
> I myself am not happy about the wording in this section; I've no idea why Unicode
> chose to say, alif on the left and lam on the right, in L2 "Any sequence with
> ALEFr on the left and LAMm" ; it would -- to me -- make more sense to discuss
> these characters in logical order, that is "lam on the right and alif on the left"
> would be better; ditto for L3; these however are just my thoughts; do what you like. }

* * *
> The following is not really a comment, but a request for a clarification:

> 8.2, "Joining" Table 8-9 p. 261
I've decided that typing a teh when teh-marbuta joins to a following possessive pronoun is probably a legacy from typing that you all may not want to change. Changing it would introduce still another possibility for confusing the two characters in IDNs, which could however be handled with bundling/folding; also with such a change it would be difficult/impossible to match words using teh-marbuta plus a pronoun (which the change would allow) to words that end in teh-marbuta but are written with teh instead at the end and followed by a pronoun (the latter is how things are now; and if the connected teh-marbuta were added, people might continue to type these words with suffixes this way); on the other hand this change would allow words ending in teh-marbuta with pronoun suffixes to be matched to the same words without pronoun suffixes. The real cut-and-past headache is for someone like me anyway who does not have an Arabic keyboard and goes to an online character picker or otherwise retrieves characters online. But dictionaries need perhaps to be aware of the issue, in my opinion (though I don't know how unicode can make dictionaries aware; folding these would not be a good idea, at least I don't think so; however it's best to ask Arabic speakers; one question, does teh-marbuta occur outside of Arabic?).

* * *
. Punctuation, p. 252

> "For paired punctuation
> such as parentheses, the glyphs chosen to represent U+0028 left parenthesis and
> U+0029 right parenthesis will depend on the direction of the rendered text."

> => { COMMENTS: since this is paired punctuation and not just U+0028, U+0029 in
> isolation, I inserted "for example;" also is U +0028 always the open parentheses?
> If so, that info should be inserted -- in parentheses or something.}

> "For paired punctuation such as parentheses, the glyphs chosen to represent, for
> example, U+0028 left parenthesis (open punctuation?) and
> U+0029 right parenthesis (closed punctuation?) will depend on the direction of the
> rendered text."

> * * *

> More Arabic -- in Section 2

> 2.2 "Logical Order" page 16, top of page
I've rethought these comments also.
Perhaps instead of saying "typically" a note could be inserted to say that in Arabic script equations can run either from left-to-right or right-to-left; that would be more to the point (although I think in handwriting some people do write least significant first in a right-to-left direction, or did at one point; but having options in directionality like this is a moot issue in computing probably). Might be good to check whether native users of Arabic script think this a good idea, however.

Best,

-- C. E. Whitehead
[email protected]

Date/Time: Tue Jun 25 16:22:52 CDT 2013
Contact: [email protected]
Name: Andrew West
Report Type: Public Review Issue
Opt Subject: Unicode 6.3 Names List

NamesList-6.3.0d8.txt includes the lines:

09F0	BENGALI LETTER RA WITH MIDDLE DIAGONAL
	= assamese letter ra
09F1	BENGALI LETTER RA WITH LOWER DIAGONAL
	= assamese letter wa

"assamese" should be capitalized in both cases.

Date/Time: Tue Jun 25 19:01:42 CDT 2013
Contact: [email protected]
Name: Andrew West
Report Type: Public Review Issue
Opt Subject: UCD 6.3 beta XML files

In ucd.nounihan.flat.xml the "standardized-variant" elements are sorted
incorrectly by code point sequences as a text string rather than being sorted
by code point values, so that variation sequences with a supra-BMP base
character are intermixed with variation sequences with a BMP base character
(e.g. "2199 FE0F", "219C8 FE00", "21A9 FE0E").

Date/Time: Wed Jun 26 05:02:32 CDT 2013
Contact: [email protected]
Name: Andrew West
Report Type: Public Review Issue
Opt Subject: Unicode 6.3 beta XML files

In ucd.nounihan.flat.xml the "emoji-source" elements are incorrectly sorted 
as text strings rather than as code point values so that 1F004 through 1F6C0 
come between 00AE and 2002, whereas 1F004 through 1F6C0 should come after 3299.

Date/Time: Wed Jun 26 05:40:22 CDT 2013
Contact: [email protected]
Name: Andrew West
Report Type: Public Review Issue
Opt Subject: Unicode 6.3 beta XML files standardized-variant sort consistency

In ucd.nounihan.flat.xml there is some inconsistency between the sort order 
of Mongolian "standardized-variant" elements with a "feminine form" and 
the order of the same standardized variants in StandardizedVariants-6.3.0d1.txt:

      <standardized-variant cps="182D 180B" desc="feminine form" when="final"/>
      <standardized-variant cps="182D 180B" desc="second form" when="initial medial"/>

vs.

182D 180B; second form; initial medial # MONGOLIAN LETTER GA
182D 180B; feminine form; final # MONGOLIAN LETTER GA

and

      <standardized-variant cps="1874 180B" desc="feminine first final form" when="final"/>
      <standardized-variant cps="1874 180B" desc="second form" when="medial"/>
      <standardized-variant cps="1874 180C" desc="feminine second final form" when="final"/>
      <standardized-variant cps="1874 180C" desc="feminine first medial form" when="medial"/>

vs.

1874 180B; second form; medial # MONGOLIAN LETTER MANCHU KA
1874 180B; feminine first final form; final # MONGOLIAN LETTER MANCHU KA
1874 180C; feminine first medial form; medial # MONGOLIAN LETTER MANCHU KA
1874 180C; feminine second final form; final # MONGOLIAN LETTER MANCHU KA 

It would be very helpful if the data in both ucd.nounihan.flat.xml and 
StandardizedVariants.txt were sorted in exactly the same order.

Date/Time: Wed Jun 26 08:42:17 CDT 2013
Contact: [email protected]
Name: Andrew West
Report Type: Public Review Issue
Opt Subject: Names List - obelus


NamesList-6.3.0d8.txt has:

00F7	DIVISION SIGN
	x (division slash - 2215)
	x (divides - 2223)
	x (heavy division sign - 2797)

2020	DAGGER
	= obelisk, obelus, long cross
	x (turned dagger - 2E38)

In fact "obelus" originally and more commonly refers to the symbol made of a
horizontal line with a dot above and below, i.e. the division sign (see
http://en.wikipedia.org/wiki/Obelus and
http://en.wikipedia.org/wiki/Dagger_(typography) ).  I suggest changing
NamesList to:

00F7	DIVISION SIGN
	= obelus
	x (division slash - 2215)
	x (divides - 2223)
	x (heavy division sign - 2797)

2020	DAGGER
	= obelisk, oblong cross
	x (turned dagger - 2E38)

Date/Time: Tue Jul 2 13:14:05 CDT 2013
Contact: [email protected]
Name: Daniel Bünzli
Report Type: Error Report
Opt Subject: UAX #42 bidi paired bracket


Hello,

In the UAX #42 the terms: 

"bidi pair bracket type"
"bidi pair bracket" 

are used. It seems the official terminology (at least the one in 
PropertyValueAliases.txt and UAX #9) is respectively:

"bidi paired bracket type"
"bidi paired bracket"

Best,
Daniel

Date/Time: Fri Jul 5 12:41:56 CDT 2013
Contact: [email protected]
Name: Andrew West
Report Type: Public Review Issue
Opt Subject: Indic Syllabic Category errors


In IndicSyllabicCategory-6.3.0d1.txt

1. The following entries have the wrong character range counts:

0915..0939    ; Consonant # Lo  [35] DEVANAGARI LETTER KA..DEVANAGARI LETTER HA
11100..11101  ; Bindu # Mn       CHAKMA SIGN CANDRABINDU..CHAKMA SIGN ANUSVARA
11180..11181  ; Bindu # Mn       SHARADA SIGN CANDRABINDU..SHARADA SIGN ANUSVARA
11133..11134  ; Virama # Mn       CHAKMA VIRAMA..CHAKMA MAAYYAA
0955..0957    ; Vowel_Dependent # Mn       DEVANAGARI VOWEL SIGN CANDRA LONG E..DEVANAGARI VOWEL SIGN UUE
A98F..A9B2    ; Consonant # Lo  [34] JAVANESE LETTER KA..JAVANESE LETTER HA
A9BE..A9BF    ; Consonant_Medial # Mc       JAVANESE CONSONANT SIGN PENGKAL..JAVANESE CONSONANT SIGN CAKRA

They should be:

0915..0939    ; Consonant # Lo  [37] DEVANAGARI LETTER KA..DEVANAGARI LETTER HA
11100..11101  ; Bindu # Mn   [2] CHAKMA SIGN CANDRABINDU..CHAKMA SIGN ANUSVARA
11180..11181  ; Bindu # Mn   [2] SHARADA SIGN CANDRABINDU..SHARADA SIGN ANUSVARA
11133..11134  ; Virama # Mn   [2] CHAKMA VIRAMA..CHAKMA MAAYYAA
0955..0957    ; Vowel_Dependent # Mn   [3] DEVANAGARI VOWEL SIGN CANDRA LONG E..DEVANAGARI VOWEL SIGN UUE
A98F..A9B2    ; Consonant # Lo  [36] JAVANESE LETTER KA..JAVANESE LETTER HA
A9BE..A9BF    ; Consonant_Medial # Mc   [2] JAVANESE CONSONANT SIGN PENGKAL..JAVANESE CONSONANT SIGN CAKRA

2. Indic_Syllabic_Category Virama is defined as [Derivation: (ccc=9) + 0E4E + 17D1]

However, there is no entry for U+2D7F TIFINAGH CONSONANT JOINER which has ccc=9.  The following entry should be added:

2D7F          ; Virama # Mn       TIFINAGH CONSONANT JOINER

Date/Time: Sat Jul 13 18:21:59 CDT 2013
Contact: [email protected]
Name: Stephan Stiller
Report Type: Error Report
Opt Subject: "apex" as alias for acute / "sicilicus" as alias for combining right half ring above


I would suggest to add "apex" as an alias for the acute accent 
(U+00B4 and U+02CA and U+0301):
    http://en.wikipedia.org/wiki/Apex_%28diacritic%29

I have a similar suggestion for the sicilicus, for the 
combining right half ring above (U+0357).

Reference:
Apex and Sicilicus (Revilo P. Oliver), The American Journal of 
Philology, 87 (2), 1966-Apr [http://www.jstor.org/stable/292702]