Comments on Public Review Issues

L2/13-202

Comments on Public Review Issues
(July 30 - October 31, 2013)

The sections below contain links to permanent feedback documents for the open Public Review Issues as well as other public feedback as of May 1, 2013, since the previous cumulative document was issued prior to UTC #135 (May 2013). This document does not include feedback on moderated Public Review Issues from the forum that have been digested by the forum moderators; those are in separate documents for each of the PRIs. Grayed-out items in the Table of Contents do not have feedback here.

Issue Name (+ feedback links)

258

Proposed Update UTS #39, Unicode Security Mechanisms (no feedback)

257

Proposed Update UTR #36, Unicode Security Considerations (no feedback)

256

Feedback on repertoire for ISO/IEC 10646:2014 (4th Edition) (feedback)

255

Feedback on repertoire for Amendment 1 to ISO/IEC 10646:2014 (4th Edition) (feedback)

251

Proposed Change to Casing Property for Some Circled or Squared Latin Letter Symbols (feedback)

248

Proposed Update UTS #46, Unicode IDNA Compatibility Processing (feedback)

The links below go to locations in this document for feedback.

Feedback on Encoding Proposals
Error Reports
Other Reports

Error Reports

Date/Time: Tue Sep 24 17:42:39 CDT 2013
Contact: roozbeh@google.com
Name: Roozbeh Pournader
Report Type: Error Report
Opt Subject: Dandas need more scripts in ScriptExtensions.txt


Currently, the common Indic dandas are listed in ScriptExtensions.txt as:

0964..0965 ; Beng Deva Guru Orya Takr # Po [2] DEVANAGARI DANDA..DEVANAGARI DOUBLE DANDA

But there is also pointers to the dandas from the following blocks in 
NamesList.txt: Gujarati, Tamil, Telugu, Kannada, Malayalam.

The text of the Core Specification says in section 9.1, under "Punctuation":

"the intent is that they be used as common punctuation for all the major 
scripts of India covered by this chapter. Danda and double danda punctuation 
marks are not separately encoded for Bengali, Gujarati, and so on."

The Gujarati, Tamil, Telugu, Kannada, and Malayalam sections in the core 
spec also clearly refer to dandas used from the Devanagari block.

Apart from that, Limbu (section 10.5) also seems to use the double danda, 
while Syloti Nagri (10.6) uses both dandas.

This means the line in ScriptExtensions.txt needs to change to:
0964 ; Beng Deva Gujr Guru Knda Mlym Orya Sylo Takr Taml Telu # Po  DEVANAGARI DANDA
0965 ; Beng Deva Gujr Guru Knda Limb Mlym Orya Sylo Takr Taml Telu # Po  DEVANAGARI DOUBLE DANDA

We could also probably use pointers in the NamesList from the Limbu and 
Syloti Nagri blocks to the dandas they use.

[The situation of Sinhala is not clear, but we can update that if we 
find more information.]

Date/Time: Wed Oct 2 17:58:08 CDT 2013
Contact: jshin1987@gmail.com
Name: Jungshik Shin
Report Type: Error Report
Opt Subject: Classification of comma vertical variants are inconsistent for line breaking


[:Line_Break=Close_Punctuation:] has 
U+FE50 (small comma), U+FE11 (presentation form for vertical ideographic comma), 
U+FF0C (full width comma) and U+FF64 (half width ideographic comma), but 
U+FE10 (presentation form for vertical comma) and U+FE51 (small ideographic 
comma) are NOT included. 

U+FE10 (presentation form for vertical comma) is LB=Infix_Numeric and 
U+FE51 (small ideographic comma) is LB=Ideographic. 

It might make sense for U+FE10 (presentation form for vertical comma) 
to have LB=Infix_Numeric because the corresponding ASCII comma 
(non-presentation form) has that, too.  

However, treating U+FE51 (small ideographic comma) and U+FE11 (presentation 
form for vertical ideographic comma) differently (the former in LB=Ideographic 
and the latter in LB=CP) seems not very consistent. 

This issue was initially reported against CLDR ( http://unicode.org/cldr/trac/ticket/6557 ).

Date/Time: Fri Oct 4 16:01:36 CDT 2013
Contact: jungshik@google.com
Name: Jungshik Shin
Report Type: Error Report
Opt Subject: Case mapping for U+0587

Hello,

This is not a bug report per se, but is just to bring an issue we
came across about the uppercase of U+0587 at Google to the UTC's attention.

U+0587 (Armenian Small letter Ligature ECH YIWN) is currently
case-mapped to a sequence of U+0535 (Amernian Capital Letter ECH)
and U+0552 (Capital letter YIWN).

There's a report from Armenian speakers in Armenia that the latest
Armenian orthography as used in Rep. of Armenia uppercases it to
a sequence of U+0535 and U+054E ( (Armenian Capital letter VEW).

OTOH, Armenian diaspora and "Western Armenian" speakers follow
the current Unicode standard.

A comment from Google's Armenian speaker:

"That form was used in Armenia before "spelling reform of the
Armenian language" at the beginning of the 20th century (1922–1924 -
according to Wikipedia). There is a variation of Armenian language
currently used by Armenian diaspora, who still use the old version.
But everyone in Armenia (including official documents and media)
are using the new form."

Another comment from a linguistics professor at Yerevan :

<quote>
So I asked this guy http://www.ysu.am/science/en/4Kg4l3vuxYoueJU5nAWSsH9JAT/type/1/page/1
who is friend of mine. His comment was "Ev is a ligature, same as &, and
as such it is not a full first class citizen letter and it cannot have a capital.
In Eastern Armenian it is usually "ԵՎ" although it is logically wrong, as the
ligature is ligature of "եւ".

To cut things short - it is illogical and historically incorrect to write ԵՎ
in his opinion, but that is the way it is done, so we shall write ԵՎ in
Eastern Armenian and ԵՒ in Western.
</quote>

Date/Time: Wed Oct 9 04:15:01 CDT 2013
Contact: michel.onoff@web.de
Name: Michel Onoff
Report Type: Error Report
Opt Subject: Underspecifications in UAX #44


I refer to the current UAX #44.

The annex lacks a syntax for the property types. For example, does 
Enumeration (E) resemble a conventional identifier and how about the 
underscore and case-sensitiveness? What's the syntax for Numeric (N), etc.?

Also, fields 6, 7, and 8 of the UnicodeData.txt are composed of a 
Numeric_Type (E) and a Numeric_Value (N). It is left unspecified 
how the two are separated, whether the former is optional and so on. 
The Numeric_Type never appears in the file, so I'm wondering if the 
provision for it is obsolete or is there for future extensions.

Fields 12, 13, 14 provide simple mappings to a single character. 
It is unspecified that the field shall be in the form of a hex code point.

Best regards
MO

Date/Time: Thu Oct 10 17:17:14 CDT 2013
Contact: roozbeh@google.com
Name: Roozbeh Pournader
Report Type: Error Report
Opt Subject: Thaana's use of Arabic number punctuation


According to the Core Spec, section 8.5, page 275, under Numerals, 
"Arabic numeric punctuation is used with digits [in Thaana], whether 
Arabic or European."

It's not very clear what that text means, but I take "Arabic numeric 
punctuation" to mean:

U+066A ARABIC PERCENT SIGN
U+066B ARABIC DECIMAL SEPARATOR
U+066C ARABIC THOUSANDS SEPARATOR

If that is the case and those are indeed used in Thaana, we need to 
add these three to ScriptExtensions.txt as:

066A..066C    ; Arab Thaa

If not, we need to clarify what the text means.

Date/Time: Tue Oct 15 12:45:59 CDT 2013
Contact: chris@lookout.net
Name: Chris Weber
Report Type: Error Report
Opt Subject: inconsistent confusables data


Please see this email thread for reference: http://www.unicode.org/mail-arch/unicode-ml/y2013-m10/0028.html

The confusables data leaves out certain characters based on the 
assumption that they would have been removed by way of NFKC 
normalization.  However, I argue that may be a dangerous assumption.  
Could there be cases where implementations want to detect 
confusability but cannot guarantee NFKC normalization?

In another case, implementations may wish to generate confusable 
data for testing or other purposes.  For example: 
http://unicode.org/cldr/utility/confusables.jsp?a=m&r=None 

With certain data missing from the equivalence sets, people who rely 
on the expertise of the Unicode Consortium may expose their implementations to vulnerability.

My ask with this report is that the confusables data be updated to 
include all characters which have a confusable potential even though 
they may not fit the profile described in 
http://www.unicode.org/reports/tr39/#Identifier_Characters.

Best regards,
Chris Weber

Date/Time: Thu Oct 17 15:35:43 CDT 2013
Contact: duerst@it.aoyama.ac.jp
Name: Martin Dürst
Report Type: Error Report
Opt Subject: Hangul normalization tests (LV + T = LVT) missing


The NormalizationTest file provided on the website 
(http://www.unicode.org/Public/UCD/latest/ucd/NormalizationTest.txt) 
seems to be missing one specific kind of pattern for Hangul. 
There are no tests that start with a "halfway-composed" Hangul 
syllable, i.e. one that uses a LV Hangul syllable followed by a T Hangul Jamo.

In NFD, this LV + T normalizes to L + V + T, which should be 
covered by the existing test for LV -> L + V. However, in NFC, 
this should normalize to LVT. There is no test that actually 
checks this, and there is a potential for errors when working on 
non-straightforward implementations (i.e. not going to NFC via NFD).

This actually happened in an implementation I was working on, 
and I only discovered the problem through a code walkthough.

An example entry in the test file to cover this case (without 
the comment) would be:

AC00 11A8;AC01;1100 1161 11A8;AC01;1100 1161 11A8

There may not be a need to provide tests for all such cases 
(around 10'000), but even having just a single one will catch 
some errors that haven't been caught up to now.

Date/Time: Fri Oct 18 17:04:27 CDT 2013
Contact: loren.brichter@gmail.com
Name: Loren Brichter
Report Type: Other Question, Problem, or Feedback
Opt Subject: UAX #9 6.3.0 Bidi Algorithm feedback


In implementing UAX #9 Bidi Algorithm (6.3.0) I encountered a few 
issues, some of which may be clarified by tweaked wording in the spec.

1. Section 5.2, X9 modifier, "assign the embedding level to each 
formatting character" and "turn it into "BN".

Turning it into BN makes sense, but to what "embedding level" is 
this referring? They are already at the embedding level that they 
are at. As these BNs are ignored in subsequent steps, theoretically 
it doesn't matter what embedding level is assigned, so perhaps 
this could be removed.

2. BD16: this algorithm makes no mention of a maximum stack depth, 
which could lead to implementations diverging. I'd love to see it 
capped at max_depth to keep things simple.

2.5. (Also, I completely skipped over the word "canonical" in BD16 
originally — mentioning that would be helpful, and even just including 
the 2(?) legacy cases would have saved me a bit of time).

Thanks,
Loren

Date/Time: Mon Oct 21 10:32:03 CDT 2013
Contact: smontagu@smontagu.org
Name: Simon Montagu
Report Type: Error Report
Opt Subject: xidmodification.text needs update for Unicode 6.3


http://www.unicode.org/Public/security/latest/xidmodifications.txt 
is still the 6.2 version and has not been updated to include 
changes in 6.3

There are at least two such changes that will affect xidmod: Firstly, 
U+180E  MONGOLIAN VOWEL SEPARATOR should change from "restricted ; 
not-xid" to "restricted ; default-ignorable". This may not make much 
practical difference, but more seriously the new U+061C ARABIC 
LETTER MARK needs to be added to "restricted ; default-ignorable" 
(The other new Bidi control characters are already there as "reserved")

Date/Time: Mon Oct 28 21:31:18 CDT 2013
Contact: lunde@adobe.com
Name: Ken Lunde
Report Type: Error Report
Opt Subject: U+1F12E (CIRCLED WZ) decomposition


I noticed an inconsistency between the the Code Chart glyph of U+1F12E 
and its decomposition. Its decomposition is <0057 005A> ("WZ"), but 
its Code Chart glyph suggests <0057 007A> ("Wz").

RESPONSE FROM KEN Whistler, 2013/10/29:

I just ran an extensive back search, and this may have been an error that I made
on May 4, 2009, which was never caught during beta review of the data files.

The Amd 6 post Dublin chart (L2/09-172) had the correct decomposition to
<circle> W z, but there are various anomalies in the process here. The U.S.
ballot comments on FPDAM6, which asked for this, L2/09-082, claimed that
the decomposition was listed in L2/09-034, Karl Penztlin's proposal document,
but it fact it wasn't. Nor was a decomposition explicitly listed in Germany's
ballot comments. That means that Michel put the decomposition in himself
in the Amd 6 data files. But there seems to be a handoff glitch for Amd 6
data for addition to the draft Unicode 5.2 data I already had lying around 
containing Amd 5 data. I can't find my copy of the FDAM 6 names list file, 
which ordinarily I would have archived. Instead I see a UnicodeData delta 
only, with a manual addition of the decomposition for U+1F12E that I did 
on May 4, 2009. I would ordinarily get the decompositions from a combination 
of examination of proposals and examination of the FDAM 6 names list annotation entries.
But 4-1/2 years later, I can't recover the exact details of what happened here.
My own handwritten UTC notes from February, 2009 are ambiguous about
whether the "z" was supposed to be uppercase or lowercase, so that might
have been the source of my original error.

At any rate, this error was totally missed in the beta review for Unicode 5.2,
and it has taken 4 years for somebody to report it as a problem. Not sure
whether that deserves a :-) or a :-(

Date/Time: Tue Oct 29 12:49:45 CDT 2013
Contact: ajithramayyan@yahoo.co.in
Name: Ajith R
Report Type: Error Report
Opt Subject: MALAYALAM CONJUNCTS NTA and TTA


I am a native malayalam speaker and wish to point out two errors in 
malayalam unicode standard 6.3.

The standard directs 
1) the sequence <0D7B, 0D4D, 0D31> to be rendered as "NTA" ‍ന്‍റ
2) the sequence <0D31, 0D4D, 0D31> to be rendered as "TTA" റ്റ

While on the face, this scheme gives the desired visual result, it is 
only as correct or wrong as using <0D7B, 0D4D, 0C67> or <0D7B, 0D4D, 
0CE7>  for "NTA" ന്‍റ or  <0C67, 0D4D, 0C67>  or <0CE7, 0D4D, 0CE7>  
to represent "TTA" റ്റ.

The "NTA" ‍ ‍ന്‍റ is actually a combination of MALAYALAM LETTER CHILLU N, 
0D7B and MALAYALAM LETTER TTTA, 0D3A, though it is written as chillu n 
combined with rra, 0D31. It is pronounced similar to the nt of ant.
Similray , the "TTA" റ്റ is a duplication of MALAYALAM LETTER TTTA, 
though it is shown as one rra below the other. It is pronounced similar 
to the t of bat, but with more stress.

The reason for this apparent digraph, where the rra, represents its 
original sound as well as "ttt", is that MALAYALAM LETTER TTTA is never 
used singly. It occurs only in these two conjuncts "NTA" ‍ന്‍റ and "TTA" റ്റ. 
In native malayalam words, RRA is not duplicated as well. So, the same 
curved symbol has been used to represent the "TTTA" occuring ion these 
conjuncts. This fact is described in the book "Samboorna Malayala Vyakaranam" 
by V Ramkumar , publisher SISO books and in it the author quotes KeralaPanini.

My suggestion is 
	1) "NTA"  ‍ന്‍റ be defined as a precomposed characters that are 
	decomposable to <0D7B, 0D4D, 0D3A> instead of  the current 
	suggestion of rendering the sequence <0D7B, 0D4D, 0D31>  as 
	"NTA"
	2) "TTA" റ്റ be defined as a precomposed characters that are decomposable 
	to <0D3A, 0D4D, 0D3A> instead of  the current suggestion of 
	rendering the sequence  <0D31, 0D4D, 0D31> as "TTA"

ajith

Date/Time: Sun Nov 3 15:49:01 CST 2013
Contact: andrewcwest@gmail.com
Name: Andrew West
Report Type: Error Report
Opt Subject: Character Name for U+2B81


The character name for U+2B81, to be added in Unicode 7.0, has a typo.  
The actual name in the ISO/IEC 10646:2012 Amd.1 text and the Unicode 
7.0 beta files http://www.unicode.org/Public/7.0.0/ucd/UnicodeData-7.0.0d12.txt
is:

UPWARDS TRIANGLE-HEADED ARROW LEFTWARDS DOWNWARDS OF TRIANGLE-HEADED ARROW

This should be:

UPWARDS TRIANGLE-HEADED ARROW LEFTWARDS OF DOWNWARDS TRIANGLE-HEADED ARROW

(cf. U+2B83 "DOWNWARDS TRIANGLE-HEADED ARROW LEFTWARDS OF UPWARDS TRIANGLE-HEADED ARROW")

As the actual name is confusing/misleading and makes it difficult for 
users to find the character in code charts etc. when searching for e.g. 
"ARROW LEFTWARDS OF", I suggest adding a named alias for U+2B81 when 
Unicode 7.0 is released.

NamesList.txt:
2B81	UPWARDS TRIANGLE-HEADED ARROW LEFTWARDS DOWNWARDS OF TRIANGLE-HEADED ARROW
	% UPWARDS TRIANGLE-HEADED ARROW LEFTWARDS OF DOWNWARDS TRIANGLE-HEADED ARROW

NameAliases.txt:
2B81;UPWARDS TRIANGLE-HEADED ARROW LEFTWARDS OF DOWNWARDS TRIANGLE-HEADED ARROW;correction

Other Reports

None at this time.

L2/13-202