The sections below contain comments received on the open Public Review Issues and other feedback as of August 3, 2010, since the previous cumulative document was issued prior to UTC #123 (May 2010).
151 Proposed Update UAX #44: Unicode Character Database
152 Proposed Update UAX #15: Unicode Normalization Forms
156 Proposed Update UAX #9: Unicode Bidirectional Algorithm
157 Proposed Update UAX #11: East Asian Width
158 Proposed Update UAX #14: Unicode Line Breaking Algorithm
159 Proposed Update UAX #24: Unicode Script Property
160 Proposed Update UAX #29: Unicode Text Segmentation
161 Proposed Update UAX #31: Unicode Identifier and Pattern Syntax
162 Proposed Update UAX #34: Unicode Named Character Sequences
163 Proposed Update UAX #38: Unicode Han Database (Unihan)
164 Proposed Update UAX #41: Common References for UAXes
165 Proposed Update UAX #42: Unicode Character Database in XML
166 Proposed Update UTS #10: Unicode Collation Algorithm
167 Ideographic Variation Database Submission
169 Glyph Variation of Double Oblique Hyphen
170 Unicode 6.0.0 Beta
171 Proposal to change properties of U+06DE ARABIC START OF RUB EL HIZB
172 Proposed Update UTS #46: Unicode IDNA Compatibility Processing
173 Invariant Tests
174 Proposed Draft UTR #49: Unicode Character Categories
Feedback on Encoding Proposals
Feedback TUS 5.2 and Charts
Closed Public Review Issues
Other Reports
Date/Time: Tue Aug 3 08:49:08 CDT 2010
Contact: emmanuel@vallois.name
Name: Emmanuel Vallois
Report Type: Public Review Issue
Opt Subject: PRI #151 Proposed Update UAX #44: Unicode Character Database
A small error, in Table 15. Canonical_Combining_Class Values, the Description for value 204 says "top right" where it should say "bottom right", the true top right is value 216
204 Marks attached at the top right
216 Attached_Above_Right Marks attached at the top right
No feedback was received via the reporting form this period.
Date/Time: Thu Jul 1 22:44:43 CDT 2010
Contact: cewcathar@hotmail.com
Name: C. E. Whitehead
Report Type: Public Review Issue
Opt Subject: tr9 proposed update
Hi, regarding
http://www.unicode.org/reports/tr9/proposed.html
My Questions and Comments are below.
In general, it's very readable. Thanks. (However I am confused about this revision since I expected some of the changes being discussed on the bidi list to be incorporated -- perhaps this is premature.)
* * *
I. QUESTIONS
Section 2. Directional Formatting Codes par 1
"All of these codes are limited to the current paragraph; thus their effects are terminated by a paragraph separator."
ALSO Section 3, par 2:
"The algorithm reorders text only within a paragraph; characters in one paragraph have no effect on characters in a different paragraph. Paragraphs are divided by the Paragraph Separator or appropriate Newline Function (for guidelines on the handling of CR, LF, and CRLF, see Section 4.4, Directionality, and Section 5.8, Newline Guidelines of [Unicode]). Paragraphs may also be determined by higher-level protocols: for example, the text in two different cells of a table will be in different paragraphs."
{ QUESTION: what happened to bidi-break (hard, soft)?
http://lists.w3.org/Archives/Public/public-i18n-bidi/2010AprJun/0074.html
"8. (Section 3.1) Add a new HTML attribute that affects the behavior of > all descendant <br> elements. > 1. Tentative syntax for the attribute: bidi-break="soft"|"hard". . . ."
It looks like you removed this from the proposed changes that you were discussing -- see: http://www.w3.org/2010/06/08-core-irc
ALSO A NOTE: Besides the reasons you mentioned for using br --
- to keep the same style that some applications have when you add additional custom text / or because an application inserted it (for display consistency -- generally hard break),
- to separate out lines of poetry within the same stanza or to separate out the end of a link or something else within a paragraph or even to separate out parts of a quotation within blockquote (generally soft break)
- I use br to increase the space between paragraphs (list elements, div's, etc.) when the default spacing is not sufficient for my purposes -- in a case where I am creating a page that will display nicely for css-illiterate browsers and thus different paragraph spacings do not work ( the end of paragraph marker is already there so no additional separation is needed)
- the w3c frequently uses br/ }
* * *
Section 2.4, last par
"That is because their effect on bidirectional ordering is exactly the same as a corresponding strong directional character; the only difference is that they do not appear in the display."
{ COMMENT/QUESTION: as an outsider I honestly do not know what "appear in the display means" Strong directional characters RLO and LRO that are part of the html code certainly do not appear in my final display of a web page! So I found this one text confusing! }
* * *
II. COMMENTS
Section 2, "Explicit Bidi Controls", second to last par and last par
"U+202A..U+202E LEFT-TO-RIGHT EMBEDDING..RIGHT-TO-LEFT OVERRIDE "On web pages, the explicit bidi controls should be replaced by using the dir attribute with the values dir="ltr" or dir="rtl". For more information, see [UTR20]." ". . . . . . The algorithm is designed so that the use of explicit codes can be equivalently represented by out-of-line information, such as stylesheet information."
{ COMMENT: dir does not always work with boundary neutrals. I suppose you know this. But this is a little frustrating. I inserted "when possible" into the sentence beginning 'On web pages' => "On web pages, the explicit bidi controls should be replaced -- when possible -- by using the dir attribute with the values dir="ltr" or dir="rtl". For more information, see [UTR20]." ". . . . . . The algorithm is designed so that the use of explicit codes can be equivalently represented by out-of-line information, such as stylesheet information." }
* * *
Section 4.3, last p
"When text using a higher-level protocol is to be converted to Unicode plain text, for consistent appearance formatting codes should be inserted to ensure that the order matches that of the higher-level protocol." http://lists.w3.org/Archives/Public/public-i18n-bidi/2010AprJun/0077.html { COMMENT: the bidi list has been discussing preserving logical order and visual order in conversion to plain text; FOR MY PART, I support preserving logical order; also visual order when html or other formatted documents are translated to plain text; however, I think discussing preservation of formatting other than directionality / visual order is out of scope for bidi (Text of list thread: "We have discussed but not reached a conclusion for the following suggestion: When translating HTML to plain text, e.g. for copy/paste, the result should contain the appropriate existing Unicode directional formatting codes so that the text is displayed in the same visual order (by UBA-compliant software) as the HTML, while retaining the text’s logical order. This should be taken up in an e-mail thread") }
* * *
III. MORE QUESTIONS -- Instances Where The Attributes Being Discussed On The Bidi List Are Pertinent -- Not Important Unless You Want To Update This draft To Reflect These Discussed Changes -- That Is, I DO NOT NEED ANSWERS; These Are Just Instances Where The Current Draft Would Be Affected By The Proposed Changes At The BIDI LIST & Can Be Dropped.
Section 4.2
{ QUESTION: Should support for attributes (such as dir) be mentioned here? What about support for the new attributes we have been talking about? Should it be mentioned that this is necessarily forthcoming? Or not? }
* * *
Section 4.3
"HL3. Emulate directional overrides or embedding codes. A higher-level protocol can impose a directional override or embedding on a segment of structured text. The behavior must always be defined by reference to what would happen if the equivalent explicit codes as defined in the algorithm were inserted into the text. For example, a style sheet or markup can set the embedding level on a span of text.
{ QUESTION: I may have gotten this wrong but is not ubi an example of such markup? }
* * *
Section 4.3
"HL4. Apply the Bidirectional Algorithm to segments. The Bidirectional Algorithm can be applied independently to one or more segments of structured text. For example, when displaying a document consisting of textual data and visible markup in an editor, a higher-level process can handle syntactic elements in the markup separately from the textual data."
{ QUESTION: is the list-style-direction attribute an attribute which enables separate handling of syntactic elements and textual elements? }
* * *
Section 5.5 and 5.6
{ COMMENT/QUESTION: Are not 5.5 and 5.6 discussing issues addressed in section 2.2 of http://www.w3.org/TR/html-bidi/#bidi-isolation Example 1-2 ; these eventually be handled as bidi-isolate ubi? }
* * *
5.6, "Migrating from 2.0 to 3.0," Table 6 -- { QUESTION: will you ultimately add another table for the new markup being discussed? not this revision I suppose? }
* * *
Best,
--C. E. Whitehead
cewcathar@hotmail.com
Date/Time: Fri Jul 2 12:12:25 CDT 2010
Contact: asmus@unicode.org
Name: Optional
Report Type: Public Review Issue
Opt Subject: PRI#156 - bidi - move text from 4.3 to new 5.7
This is editorial in nature
It has been noted that the last sentence in 4.3 "When text using a higher-level protocol is to be converted to Unicode plain text, for consistent appearance formatting codes should be inserted to ensure that the order matches that of the higher-level protocol" is out of context of the rest of the section.
A proposed editorial fix is to make this a new section 5.7 with a title like: Conversion from rich text to plain text.
This fix is appropriate on another level, since the sentence in question is not an explication of the clauses introduced in 4.3, but an implementation note, such as those found in section 5. It's current location seems to have been selected because it contains the phrase "higher-level protocol" but that, by itself, is not sufficient grounds to keep this information in section 4.3.
Making this change would more easily accommodate future elaboration of this important piece of implementation advice.
Date/Time: Fri Jul 2 12:26:30 CDT 2010
Contact: asmus@unicode.org
Name: Optional
Report Type: Error Report
Opt Subject: PRI#156 - bidi - wording
This is editorial
In section 4.3 there is the following text:
Clauses HL1 and HL3 are not logically necessary; they are covered by applications of clauses HL4 and HL5. However, they are included for clarity because they are more common operations.
This formulation is exceedingly awkward for a standard. Here's a suggestion for a better phrasing.
Clauses HL1 and HL3 are specializations of applying the more general clauses HL4 and HL5. They are provided here explicitly because they directly correspond to common operations.
I believe a formulation like the one suggested conveys the same information (HL1 and HL3 are subsumed by HL4 and HL5), but in a way that sounds less like making an excuse. It also avoids the fuzzy "for clarity" wording.
Date/Time: Sun Jul 18 11:11:21 CDT 2010
Contact: matial@il.ibm.com
Name: Matitiahu Allouche
Report Type: Error Report
Opt Subject: Proposed Update Unicode Standard Annex #9
In section 4.3 "Higher-Level Protocols", in the first paragraph of explanations for HL1, it would be nice to make the mention of "BD2" a link, just like is the case for the mentions of P2 and P3.
No feedback was received via the reporting form this period.
Date/Time: Mon Aug 2 09:17:51 CDT 2010
Contact: emmanuel@vallois.name
Name: Emmanuel Vallois
Report Type: Public Review Issue
Opt Subject: PRI #158 Proposed Update UAX #14: Unicode Line Breaking Algorithm
I will here submit what I think is an omission. Section 5.1 Description of Line Breaking Properties, property BA: Break After (A), Dandas These should probably be included:
A9C8 JAVANESE PADA LINGSA A9C9 JAVANESE PADA LUNGSI ABEB MEETEI MAYEK CHEIKHEI 11047 BRAHMI DANDA 11048 BRAHMI DOUBLE DANDA 110C0 KAITHI DANDA 110C1 KAITHI DOUBLE DANDA
Here I think the current wording is incorrect and unclear:
Section 5.1 Description of Line Breaking Properties, property ID: Ideographic (B/A) The list after "The ID line break class includes the following characters:" is incorrect. It gives the false impression that all listed code points are of ID line break class, but this is not the case. Besides, the description is incomplete, even when taking into account the sentence below the list. I will detail the problem for each incorrect line of the list.
2E80..2FFF CJK, KANGXI RADICALS, DESCRIPTION SYMBOLS - All *assigned* characters in this range are ID, not unassigned characters 3000 IDEOGRAPHIC SPACE - OK 3040..309F Hiragana (except small characters) 30A0..30FF Katakana (except small characters) -Not all assigned characters in these ranges are ID, there are some CM (U+3099..U+309A), and NS (U+309B..U+309E and U+30FB..U+30FE) that are not small letters, unassigned characters are not ID 3400..4DB5 CJK UNIFIED IDEOGRAPHS EXTENSION A -OK 4E00..9FBB CJK UNIFIED IDEOGRAPHS -end of range is incorrect since Unicode 5.1 F900..FAD9 CJK COMPATIBILITY IDEOGRAPHS -OK A000..A48F YI SYLLABLES A490..A4CF YI RADICALS -use the true end of assigned characters, unassigned characters are not ID FE62..FE66 SMALL PLUS SIGN to SMALL EQUALS SIGN -why not FE5F..FE66 SMALL NUMBER SIGN to SMALL EQUALS SIGN FF10..FF19 WIDE DIGITS -OK 20000..2A6D6 CJK UNIFIED IDEOGRAPHS EXTENSION B -OK, but what about extension C, extension D ? Is that necessary to list plane 2 blocks at all, or all of them, given that all code points of this plane default to ID ? 2F800..2FA1D CJK COMPATIBILITY IDEOGRAPHS SUPPLEMENT -OK
There are also some characters with line breaking class ID in the CJK Compatibility Forms block, in the Small Form Variants block and not in the list and in the Halfwidth and Fullwidth Forms which are not LATIN LETTERS. I feel some clarification is needed.
Below, add CJK Unified Ideographs Extension D to the list of blocks and regions in which unassigned characters default to line break class ID.
Date/Time: Tue Jul 27 20:54:33 CDT 2010
Contact: cewcathar@hotmail.com
Name: CE Whitehead
Report Type: Public Review Issue
Opt Subject: minor grammar comment on new text only in tr24; sorry it got long
http://www.unicode.org/reports/tr24/proposed.html
Minor grammar comment (note it may be easiest for you to make your uses of "data" singular in all cases since this is what you seem to be doing most)
* * *
2.8 example 3
"For many common tasks, the regex expression [:script=Arab:] is too narrow, because it does not include U+0660 ( ٠ ) ARABIC-INDIC DIGIT ZERO, but the expression [[:script=Arab:][:script=Common:]] is far too broad, because it also includes thousands of symbols, plus the U+30FC ( ー ) KATAKANA-HIRAGANA PROLONGED SOUND MARK. A regex engine can instead specify that a regular expression like [:script=Arab:] matches any character with the script value Arab <em><strong>or</strong></em> whose extended script data lists . . . "
{ COMMENT: decide whether you want "data" singular or plural in this section and elsewhere -- that is "data lists" or "data list" ; however generally you've opted for singular data so it's easiest to stick with that }
* * *
2.8 last par
"The data in this file is primarily targeted at customary modern use of characters, and do not encompass technical usage such as UPA or math. The data is based on the best available knowledge of usage, which may change over time. "
{ COMMENTS general: right, data is traditionally a Latin plural but can be used with a singular verb in some 'usage circles' at least (and I do not think datum is that good a word; sorry to the Latin purists; I waited till college to get my Latin reserving my h.s. for Spanish and French; perhaps that's what's wrong) like other 'collective' nouns such as "team," "group," etc. }
( if you care to read further on this ; not sure ; anyway if you do: http://www.childrensmercy.org/stats/ask/data_is.asp http://www.wfu.edu/biology/albatross/dataare.htm http://www.gi.alaska.edu/ScienceForum/ASF3/334.html )
Options
* 1
{COMMENTS: option 1 make both plural "is" => "are" } =>
"The data in this file is primarily targeted at customary modern use of characters, and does not encompass technical usage such as UPA or math. The data are based on the best available knowledge of usage, which may change over time. "
* 2
{COMMENTS: option 2 make both singular "do" => "does" this may be more preferred in some but not all technical circles; but yours encompasses languages and linguistics . . . }
"The data in this file is primarily targeted at customary modern use of characters, and do not encompass technical usage such as UPA or math. The data are based on the best available knowledge of usage, which may change over time. "
* 3
{COMMENTS: option 3 change "data" to "file" in a rephrasing; file is always singular; in this case you have kept data plural so also change the second "is" => "are" }
"This file is primarily targeted at customary modern use of characters, and its data do not encompass technical usage such as UPA or math. The data are based on the best available knowledge of usage, which may change over time. "
* * *
Section 4 last par
"This data is provided provisionally to supplement the data in Scripts.txt. Because this is supplemental data, not associated with a separate Unicode property, there is no default value for code points not explicitly mentioned in the data file." { COMMENTS: perhaps you want "data" plural here if you make it plural everywhere; in that case change "This" to "These" and "is" to "are" but then in the next sentence you need to say "Because this file consists of supplemental data" or some such, as I no longer feel comfortable with "this is . . . " and opted for "this file consists of . . ." } =>?
"These data are provided provisionally to supplement the data in Scripts.txt. Because this file consists of supplemental data, not associated with a separate Unicode property, there is no default value for code points not explicitly mentioned in the data file."
* * *
(NOTE: I've not gone through all of the document yet; just the highlighted text; and then I did a search on "data" throughout the document. You did seem to decide in all other parts of the document that "data" was singular; here are instances where "data" was used with an exclusively singular verb; using wordpad's "find" quickly I did not find any instance where it was used with an exclusively plural verb except in the one instance of the highlighted new text above:
* * *
1.1 "The data in the Default Unicode Collation Element Table (DUCET) is grouped by script, so that letters of different scripts have different primary sort weights. However, numbers, symbols, and punctuation are not grouped with the letters."
2.1 par 2
"As more data on the usage of individual characters is collected, the script property value assigned to a character may change."
3.2 last par
"Script values are not immutable. As more data on the usage of individual characters is collected, script values may be reassigned using the above methodology."
{ I guess in most of these instances "is" -- like you've been using -- is fine } )
Best,
C. E. Whitehead
cewcathar@hotmail.com
Date/Time: Thu Jul 29 08:12:19 CDT 2010
Contact: verdy_p@wanadoo.fr
Name: Philippe Verdy
Report Type: Error Report
Opt Subject: [Data24][ISO15924] Multiple scripts and Arabic
In the light of TR24 update
(1) Existing exception :
There's one example of a digit which has a numeric type = decimal, AND is encoded in a "scattered" way:
19DA;6618;᧚;New Tai Lue Tham Digit One;Nd;0;L;...;1;1;1;N
The other decimal nine digits for the Tham variant of the New Tai Lue digits are borrowed from another sequence of decimal digits, starting at U+19D0 (for digit zero) with the exception of U+19D1 which is replaced (for digit one). Both sets are assigned in the same "New_Tai_Lue" script property value.
So the additional stability proposal (discussed on the mailing list) will not be enforceable.
(2) Arabic digits :
Such case was avoided for the Eastern/Extended variant of Arabo-Indic digits in U+06F0..U+06F9, without borrowing the common forms for the Standard variant in U+0660.U+0669: they were reencoded separately to create a complete sequence of 10 digits, even if most of them (all except 4 to 6) are exactly similar and belong to the same unified "script".
But what is even more "strange" is that the Standard Arabic digits are assigned to the "Common" script, when the Eastern/Extended variant is assigned to the "Arabic" script (look at the Unicode script property value, from the file "Scripts-5.2.0.txt" in the UCD).
If you just look at this property, you may think that the Extended/Eastern digits are the standard ones for the Arabic script: this is a side-effect of unification of Western and Eastern variants of the Arabic script.
(3) Unification of the Arabic script:
Ideally, there should be two additional separate ISO 15924 script codes for the Western and Eastern variants the Arabic script (possibly [Arbs] for Standard/Western, and [Arbx] for Extended/Eastern), and the Unicode "script" property value alias for the Western and Eastern digits or letters should be segregated, using a separate Script property value (splitting the Arabic script, where it is significant, just like it occured for Georgian and Greek/Coptic alphabets).
Nothing will be changed for the existing Arabic script, but the "Extended/Eastern Arabic" script (assigned with a new ISO 15924 code and mapped with a new property alias in Unicode), will still borrow most of its letters from the standard script without reencoding them.
No character or block will be renamed (and I DO NOT propose to disunifying existing common Arabic letters, or assigning them in the "Common" script), it should just be a better sub-classification, where the characters are clearly distinguished between the two variants.
Most Arabic characters should remain in the common "Arabic" script, and those that are differentiated should be assigned in a "Standard_Arabic" or "Extended_Arabic" script. But this may cause some complication for the script inheritance in spans of texts (because the "Arabic" script property value would behave a bit like what the "Common" does for alphabetic scripts, i.e. like a group of scripts).
Such change for the assigned script property value (if it's not already stabilized) would require documentation, and changes in a few other core or derived datafiles:
- PropertyValueAliases.txt (adding two new property values for "sc"):
sc ; Arab ; Arabic # All forms, includes "sc=Arbc", "sc=Arbs" and "sc=Arbx" in regexps) sc ; Arbc ; Common_Arabic sc ; Arbs ; Standard_Arabic # (also includes "sc=Arbc" in regexps) sc ; Arbx ; Extended_Arabic # (also includes "sc=Arbc" in regexps)
- Script.txt (assigning the two new property values to remap existing "Arabic")
- Arabic-Shaping.txt (possibly adding comments at end of lines where this is not the Common Arabic)
- Joining-Groups.txt (same remark)
- Bidi-Mirroring.txt (same remark)
And in the description of some standard script identification and segmentation algorithms. I don't know if IDNA should continue to use "Arab" (all forms) or if it should segregate "Arbs" and "Arbx" (to avoid mixing digits that are visually confusable), as it uses such segmentation (note that these characters are canonically different, for normalization purposes.
These distinctions should also be included in the proposed ScriptExtended.txt for TR24 in Unicode 6.0 (does it define a new property name/property alias ?)
Date/Time: Mon Aug 2 09:24:00 CDT 2010
Contact: emmanuel@vallois.name
Name: Emmanuel Vallois
Report Type: Public Review Issue
Opt Subject: PRI #160 Proposed Update UAX #29: Unicode Text Segmentation
Some character additions in version 5.2 of the standard are not reflected in the following table: 3.1 Default Grapheme Cluster Boundary Specification, Table 2 For value Prepend, include all characters that have Logical_Order_Exception=True, this now includes these Tai Viet characters as well:
AAB5 TAI VIET VOWEL E AAB6 TAI VIET VOWEL O AAB9 TAI VIET VOWEL UEA AABB TAI VIET VOWEL AUE AABC TAI VIET VOWEL AY
For values L, V and T, include new characters in the Hangul Jamo block as well as in the new Hangul Jamo Extended-A and Hangul Jamo Extended-B blocks. This gives:
L Hangul_Syllable_Type=L, that is: U+1100 (ᄀ) HANGUL CHOSEONG KIYEOK ..U+115F (ᅟ) HANGUL CHOSEONG FILLER U+A960 (ꥠ) HANGUL CHOSEONG TIKEUT-MIEUM ..U+A97C (ꥼ) HANGUL CHOSEONG SSANGYEORINHIEUH V Hangul_Syllable_Type=V, that is: U+1160 (ᅠ) HANGUL JUNGSEONG FILLER ..U+11A7 (ᆧ) HANGUL JUNGSEONG O-YAE U+D7B0 (ힰ) HANGUL JUNGSEONG O-YEO ..U+D7C6 (ퟆ) HANGUL JUNGSEONG ARAEA-E T Hangul_Syllable_Type=T, that is: U+11A8 (ᆨ) HANGUL JONGSEONG KIYEOK ..U+11FF (ᇿ) HANGUL JONGSEONG SSANGNIEUN U+D7CB (ퟋ) HANGUL JONGSEONG NIEUN-RIEUL ..U+D7FB (ퟻ) HANGUL JONGSEONG PHIEUPH-THIEUTH
Date/Time: Wed Jun 16 16:19:08 CDT 2010
Contact: behdad@behdad.org
Name: Behdad Esfahbod
Report Type: Error Report
Opt Subject: Conflicting clauses in UAX#31
Under Section 1 Introduction:
a) Right after Figure 1 it reads: "The ID Nonstart set is defined as the set difference ID_Continue minus ID_Start." Simple set theory suggests that ID Nonstart and ID_Start are mutually exclusive, ie. no character can belong to both groups.
b) Then later in the section, under stability it says: "The ID_Start and ID Nonstart characters may grow over time... However, neither will ever decrease."
c) But the following Table 1 suggests that an ID Nonstart character may change into ID Start in future versions of Unicode.
>From a) and c) it follows that in a future version of Unicode a character may cease to be in ID Nonstart. That's clearly in conflict with b).
Suggested fix: under stability, instead of promising non-shrinkage on ID_Start and ID Nonstart, make such promise re ID_Start and ID_Continue.
(editorial note: is there a reason that "ID Nonstart" is spelled without an underscore while all other categories use underscore instead of space in their names?)
No feedback was received via the reporting form this period.
Date/Time: Sat May 15 04:38:17 CDT 2010
Contact: francois@edemay.com
Report Type: Error Report (Unihan)
Opt Subject:
見见
see, observe, behold; percieve (typo : perceive)
No feedback was received via the reporting form this period.
No feedback was received via the reporting form this period.
Date/Time: Tue May 25 06:53:46 CDT 2010
Contact: emmanuel@vallois.name
Name: Emmanuel Vallois
Report Type: Public Review Issue
Opt Subject: PRI166, Proposed Update UTS #10: Unicode Collation Algorithm
-One little typo
In section 5.2, in the table labeled Equivalent Tailorings, second column "Unicode Collation Element Table", the last character of the fourth line (beginning with C0) should be À, not à.
-In the References section, for [JavaCollator],
I suggest updating the link to current (6) or next (7, due Sept 2010) version of Java. Even if the API has not changed, it would give an up-to-date feeling, rather than referencing a nearly decade-old version. This would give the following URLs:
http://java.sun.com/javase/6/docs/api/java/text/Collator.html
http://java.sun.com/javase/6/docs/api/java/text/RuleBasedCollator.html
or
http://java.sun.com/javase/7/docs/api/java/text/Collator.html
http://java.sun.com/javase/7/docs/api/java/text/RuleBasedCollator.html
* NOTE: This has already been taken care of by the ed committee.
Date/Time: Fri Jun 4 03:31:55 CDT 2010
Contact: ake.persson@mimer.se
Name: Åke Persson
Report Type: Public Review Issue
Opt Subject: #166 UCA - Middle dot collation
U+00B7 MIDDLE DOT is defined to collate as punctuation, except when preceded by the LETTER L, where it collates as an accent. This special-case handling of special combinations for a particular language in DUCET is erroneous. It should be handled by a Catalan-specific tailoring. The collation key for LETTER L WITH MIDDLE DOT should be changed to collate as LETTER L + MIDDLE DOT.
Date/Time: Fri Jun 4 03:54:44 CDT 2010
Contact: ake.persson@mimer.se
Name: Åke Persson
Report Type: Public Review Issue
Opt Subject: #166 UCA - Kannada collation
U+0CF1 KANNADA SIGN JIHVAMUIYA and U+0CF2 KANNADA SIGN UPADHMANIYA are letter 19 and 20 in the Kannada alphabet. U+0C95 KANNADA LETTER KA is letter 21. DUCET should be changed accordingly.
Reference http://www.unicode.org/L2/L2007/07230-n3290-vedic.pdf, page 12, Fig 2C,2D.
Date/Time: Fri Jun 4 06:41:24 CDT 2010
Contact: ake.persson@mimer.se
Name: Åke Persson
Report Type: Public Review Issue
Opt Subject: #166 UCA - Khmer Cf characters
U+17B4 KHMER VOWEL INHERENT AQ and U+17B5 KHMER VOWEL INHERENT AA are the only Cf characters that have a primary weight. This is probably an error, because Khmer tailorings reassign them to be ignorable.
Date/Time: Tue Jun 8 06:40:07 CDT 2010
Contact: ake.persson@mimer.se
Name: Åke Persson
Report Type: Public Review Issue
Opt Subject: #166 UCA - Default Ignorable Code Points
In 5.3 Unknown and Missing Characters, section Default Ignorable Code Points, we have the following statement: "To allow a greater degree of compatibility across versions of the standard, the ranges U+2060..U+206F,U+FFF0..U+FFFB, and U+E0000..U+E0FFF are reserved for format and control characters (General Category = Cf). Unassigned code points in these ranges should be ignored in processing and display." However, unassigned reserved control format characters are given primary weight instead of being ignored in collation.
Date/Time: Tue Jun 8 08:06:20 CDT 2010
Contact: ake.persson@mimer.se
Name: Åke Persson
Report Type: Public Review Issue
Opt Subject: #166 UCA - Latin title case letter collation
The following Latin title case letters have incorrect tertiary level weights:
U+01C5 LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON U+01C8 LATIN CAPITAL LETTER L WITH SMALL LETTER J U+01CB LATIN CAPITAL LETTER N WITH SMALL LETTER J U+01F2 LATIN CAPITAL LETTER D WITH SMALL LETTER Z
They are expansions and the <compat> weight is assigned to both keys instead of only the last key. Example:
dz U+0064,U+007A tertiary weight 02 02 dz U+01F3 tertiary weight 04 04 Dz U+0044,U+007A tertiary weight 08 02 DZ U+0044,U+005A tertiary weight 08 08 Dz U+01F2 tertiary weight 0A 04 DZ U+01F1 tertiary weight 0A 0A
Date/Time: Sun Jul 25 04:14:27 CDT 2010
Contact: SADAHIRO@cpan.org
Name: SADAHIRO Tomoyuki
Report Type: Error Report
Opt Subject: Collation Test and CJK ExtC (5.2.0)
(In Unicode 5.2.0) though 2A700..2B734 are Unified_Ideograph, but Collation Conformance Tests (wrongly) uses FBC0 (not FB80) as BASE; and places them after unassigned code points.
Please see 25 lines in CollationTest_NON_IGNORABLE.txt
FFF0 0062; # ...[FBC1 FFF0 1225.. 2A700 0021; # ...CJK UNIFIED IDEOGRAPH-2A700 2A704 0062; # ...CJK UNIFIED IDEOGRAPH-2A704 C0000 0021; # ...<reserved-C0000> [FBD8 8000 026E | 0020 0020 | 0002 0002 |]
and corresponding parts in CollationTest_SHIFTED.txt
Date/Time: Mon Jul 26 04:45:31 CDT 2010
Contact: ake.persson@mimer.se
Name: Åke Persson
Report Type: Public Review Issue
Opt Subject: #166 UCA - Malayalam collation
U+0D29 MALAYALAM LETTER NNNA is better collated between U+0D31 MALAYALAM LETTER RRA and U+0D3A MALAYALAM LETTER TTTA.
Reference http://transliteration.eki.ee/pdf/Malayalam.pdf
Date/Time: Thu Jul 29 03:27:42 CDT 2010
Contact: ake.persson@mimer.se
Name: Åke Persson
Report Type: Public Review Issue
Opt Subject: #166 UCA - Malayalam AU-sign
The modern AU-sign (U+0D57) is primary different from the archaic AU-sign (U+0D4C). The difference should only be at tertiary level.
Date/Time: Thu Jun 24 04:10:04 CDT 2010
Contact: y.naoi@glamour.co.jp
Name: NAOI Yasushi
Report Type: Public Review Issue
Opt Subject: Comment on PRI167
1) I suppose that the following glyphs are already in CJK Unified Ideographs Extension B/D.
KS001760 (U+4E55 variant-2) : U+200B0 JTAD7A (U+50B3 variant-3) : U+2B74A JTB546 (U+51DE variant-3) : U+20611 JTB24ES (U+6577 variant-2): U+22FBE FT1786S (U+7A3D variant-2): U+25874 JTB7A2 (U+7B0B variant-2) : U+25B01 IB0861 (U+82E5 variant-2): U+20C25 JTBA87 (U+8613 variant-2) : U+27068 JTC095 (U+9F21 variant-2) : U+21FF3
http://d.hatena.ne.jp/NAOI/20100416/1271405196
http://d.hatena.ne.jp/NAOI/20100421/1271838927
2) I think that the base character of the following glyph should change.
JTBE25 (U+93AD): U+93AE
http://d.hatena.ne.jp/NAOI/20100422/1271930072
No feedback was received via the reporting form this period.
Date/Time: Thu May 27 19:14:46 CDT 2010
Contact: roozbeh@gmail.com
Name: Roozbeh Pournader
Report Type: Error Report
Opt Subject: The missing Mongolian technical report
The Unicode book, in Section 13.2, page 414 of Unicode 5.2, says that: "[...] only the essential features of Mongolian shaping behavior are presented here; the precise details are to be presented in a separate technical report".
This is quite misleading, as people will then go to the technical reports section of the website, search for Mongolian, and arrive at the extremely superseded UTR#2.
I would go for either removing the part after the semicolon (safer, as we may never get to write a Mongolian spec, and if we write it, it should better go into the book), or replace the phrase "technical report" with something like "document". Anyhow, making this explicit helps a lot.
Date/Time: Tue Jun 8 22:57:07 CDT 2010
Contact: lekktu@gmail.com
Name: Juanma Barranquero
Report Type: Public Review Issue
Opt Subject: Public Review Issue #170
In the UnicodeData file at http://www.unicode.org/Public/6.0.0/ucd/UnicodeData-6.0.0d5.txt, the following code point:
1F521;INPUT SYMBOL FOR LATIN SMALL LETTERS;So;0;On;;;;;N;;;;;
has bidi class "On" instead of "ON".
Date/Time: Thu Jun 10 04:12:55 CDT 2010
Contact: andrewcwest@gmail.com
Name: Andrew West
Report Type: Public Review Issue
Opt Subject: Unicode 6.0 Beta Data Files
UnicodeData-6.0.0d5.txt
1F521;INPUT SYMBOL FOR LATIN SMALL LETTERS;So;0;On;;;;;N;;;;;
should be
1F521;INPUT SYMBOL FOR LATIN SMALL LETTERS;So;0;ON;;;;;N;;;;;
Scripts-6.0.0d6.txt
Missing script assignments for 1,732 out of the 2,087 new characters. It would be helpful for beta testing to have a version of Scripts.txt that covers all of Unicode 6.0 at the earliest opportunity.
Date/Time: Sun Jun 13 20:11:19 CDT 2010
Contact: andrewcwest@gmail.com
Name: Andrew West
Report Type: Public Review Issue
Opt Subject:
ScriptExtensions-6.0.0d7.txt
# Script_Extensions=Bopo Hang Hani Hira Kana Phag Tibt Yiii
3001..3002 ; Bopo Hang Hani Hira Kana Phag Tibt Yiii # Po [2] IDEOGRAPHIC COMMA..IDEOGRAPHIC FULL STOP 3008 ; Bopo Hang Hani Hira Kana Phag Tibt Yiii # Ps LEFT ANGLE BRACKET 3009 ; Bopo Hang Hani Hira Kana Phag Tibt Yiii # Pe RIGHT ANGLE BRACKET 300A ; Bopo Hang Hani Hira Kana Phag Tibt Yiii # Ps LEFT DOUBLE ANGLE BRACKET 300B ; Bopo Hang Hani Hira Kana Phag Tibt Yiii # Pe RIGHT DOUBLE ANGLE BRACKET 300C ; Bopo Hang Hani Hira Kana Phag Tibt Yiii # Ps LEFT CORNER BRACKET 300D ; Bopo Hang Hani Hira Kana Phag Tibt Yiii # Pe RIGHT CORNER BRACKET 300E ; Bopo Hang Hani Hira Kana Phag Tibt Yiii # Ps LEFT WHITE CORNER BRACKET 300F ; Bopo Hang Hani Hira Kana Phag Tibt Yiii # Pe RIGHT WHITE CORNER BRACKET 3010 ; Bopo Hang Hani Hira Kana Phag Tibt Yiii # Ps LEFT BLACK LENTICULAR BRACKET 3011 ; Bopo Hang Hani Hira Kana Phag Tibt Yiii # Pe RIGHT BLACK LENTICULAR BRACKET 3014 ; Bopo Hang Hani Hira Kana Phag Tibt Yiii # Ps LEFT TORTOISE SHELL BRACKET 3015 ; Bopo Hang Hani Hira Kana Phag Tibt Yiii # Pe RIGHT TORTOISE SHELL BRACKET 3016 ; Bopo Hang Hani Hira Kana Phag Tibt Yiii # Ps LEFT WHITE LENTICULAR BRACKET 3017 ; Bopo Hang Hani Hira Kana Phag Tibt Yiii # Pe RIGHT WHITE LENTICULAR BRACKET 3018 ; Bopo Hang Hani Hira Kana Phag Tibt Yiii # Ps LEFT WHITE TORTOISE SHELL BRACKET 3019 ; Bopo Hang Hani Hira Kana Phag Tibt Yiii # Pe RIGHT WHITE TORTOISE SHELL BRACKET 301A ; Bopo Hang Hani Hira Kana Phag Tibt Yiii # Ps LEFT WHITE SQUARE BRACKET 301B ; Bopo Hang Hani Hira Kana Phag Tibt Yiii # Pe RIGHT WHITE SQUARE BRACKET 30FB ; Bopo Hang Hani Hira Kana Phag Tibt Yiii # Po KATAKANA MIDDLE DOT FF61 ; Bopo Hang Hani Hira Kana Phag Tibt Yiii # Po HALFWIDTH IDEOGRAPHIC FULL STOP FF62 ; Bopo Hang Hani Hira Kana Phag Tibt Yiii # Ps HALFWIDTH LEFT CORNER BRACKET FF63 ; Bopo Hang Hani Hira Kana Phag Tibt Yiii # Pe HALFWIDTH RIGHT CORNER BRACKET FF64..FF65 ; Bopo Hang Hani Hira Kana Phag Tibt Yiii # Po [2] HALFWIDTH IDEOGRAPHIC COMMA..HALFWIDTH KATAKANA MIDDLE DOT
This script set is problematic:
1. Although I have seen Tibetan texts published during the Cultural Revolution period that use Chinese comma and period marks, Tibetan text does not normally use any of the non-bracket characters listed above, and I think it is a mistake to include Tibetan in this set.
2. Phags-pa is an historic script (or in modern Tibetan usage a decorative script), so never normally uses modern punctuation marks (it normally does not use any punctuation marks at all). It does sometimes use the ideographic full stop (as that is an ancient punctuation mark) but does not normally use any of the other characters listed above. (Of course, if one wanted, any of the bracket characters could be used for Phags-pa, but that is true for any script ... which is why such characters are sensibly left simply as "common".)
3. Any of the bracket characters could reasonably be used (and probably have been used) for any script used within China, which would make this a lot longer set of scripts than it currently is. I see no advantage in trying to enumerate all or any of the open-ended set of scripts that could use characters that have the "common" script property.
In my opinion any character that is used or potentially used in more than two or three scripts should be simply left as common as any enumeration of scripts that use such characters will inevitably be inaccurate and subject to change and fragmentation as new scripts are encoded. I would strongly recommend removing the "Bopo Hang Hani Hira Kana Phag Tibt Yiii" set from this document as it is more harmful than helpful. Personally I would rethink the whole exercise, and remove the file entirely from Unicode 6.0 for further study.
Date/Time: Wed Jul 21 14:41:08 CDT 2010
Contact: markus.icu@gmail.com
Name: Markus Scherer
Report Type: Public Review Issue
Opt Subject: Unicode 6 Script_Extensions
I see the new Script_Extensions property defined in the new ScriptExtensions.txt file, but not listed in PropertyAliases.txt.
Problem: I will need to implement this property and want to use an abbreviation. Please list the property in PropertyAliases.txt and define an abbreviation (I propose "scx" as in [:scx=Arab:]) so that implementers do not have to invent their own, incompatible names for this property.
For example:
# ================================================ # Miscellaneous Properties # ================================================ scx ; Script_Extensions
Date/Time: Wed Jul 28 03:45:50 CDT 2010
Contact: kent.karlsson14@telia.com
Name: Kent Karlsson
Report Type: Error Report
Opt Subject: confusing example in "nameslist"
0970 DEVANAGARI ABBREVIATION SIGN * intended for Devanagari-specific abbreviations, such as the rupee sign
Surely that does not apply to U+20A8 RUPEE SIGN, nor the the new rupee symbol that is to be encoded. Either use another example, or spell out that example more clearly (like "... such as when abbreviating rupee as <whatever is used as that abbreviation>")
Date/Time: Thu Jul 29 17:57:46 CDT 2010
Contact: asmus@unicode.org
Name: optional
Report Type: Public Review Issue
Opt Subject: Improvement for Section 4.6
Problem:
Description of much of the numeric use of characters is scattered across the standard.
Solution:
Create one place where links to existing descriptions are collected.
I suggest that a rather complete summary of all exceptional cases of numerical character use be maintained in a central place (section 4.6 comes to mind). This would provide implementers with the information they need to track down unusual behavior.
Currently, the information is not easily locatable, other than scanning all chapters or searching for keywords (which may not be selective enough or not applied with enough consistency). By its nature, the information is useful to implementers of cross-script (or script neutral) implementations.
A table of links (references) is sufficient for this purpose - there's no need to duplicate any description or add a lot of explanatory text.
The table could be broken down into two main sections.
The first section would link to descriptions of unusual behavior for characters used as decimal-radix digits.
Examples of this are:
a) Arabic using two complete series of digits
b) New Thai Lue using an extra digit 1
c) Han digits being scattered and used in two
different types of numeric expressions
d) ASCII digits being used for some scripts as preferred decimal-radix digits, because their native number system is not, or not exclusively decimal-radix (Han, Ethipic..)
e) Compatibility characters based on decimal digits
A separate section of the table should pull together links to all the descriptions of non-decimal radix number systems that are discussed in the Standard, such as
a) roman numerals
b) Greek characters used as numerals
e) Ethiopic numerals
c) ....
Date/Time: Mon Aug 2 14:23:45 CDT 2010
Contact: karl-pentzlin@acssoft.de
Name: Karl Pentzlin
Report Type: Public Review Issue
Opt Subject: Unicode 6.0 beta: some nitpicking about Named Character Sequences
Unicode 6.0 beta: some nitpicking about Named Character Sequences:
1.) The term "Named Character Sequence" is not used consequently. In the Unicode 5.2 Standard text, at some places the term "Named Sequence" is used instead. This also applies to the Pipeline table.
2.) On p.149 of the Unicode 5.2 standard text, the 6th paragraph reads: "The names for named character sequences are also immutable. Once assigned, they will never be changed in subsequent versions of the Unicode standard." I propose to add to this that also the Named Character Sequences themselves (i.e. the sequence of characters which constitute them) also never will be changed, nor deleted (once approved beyond provisionally).
Date/Time: Thu Aug 5 09:29:58 CDT 2010
Contact: as@signographie.de
Name: Andreas Stötzner
Report Type: Public Review Issue
Opt Subject: 26FC
The representative glyph is special Japanese, whereas the Name and description are general. But in cultures other than Japan a cemetery sign needs a totally different rendering, e.g. in Europe most likely crosses instead of turned T-shapes.
Suggested solution:
a) adding JAPANESE to the name;
or
b) adding an annotation: "glyph may show cruciform or other shapes instead"
Date/Time: Thu Aug 5 01:30:29 CDT 2010
Contact: satai@akauri.com
Name: Alex Ostrovski
Report Type: Error Report
Opt Subject: Georgian and Armenian punctuation error
Greetings, gentlemen!
There is an error in comments in Armenian code chart (U0530) and in Georgian section of The Unicode Standard (Chapter 7. European Alphabetic Scripts) both in published Unicode 5.2.0 and in Unicode 6.0.0 beta.
The error concerns Georgian punctuation, so it is necessary to note that Georgian uses exactly the same punctuation system as "modern" common punctuation systems in English, French or Russian.
ISSUE
1) In Armenian block (U0530) U+0589 ARMENIAN FULL STOP (։) has a comment «may also be used for Georgian». This is not correct, Georgian uses a "standard" dot for a full stop indication and Armenian character (which is visually equal to colon) cannot be used in Georgian for this purpose. Even more, Georgian uses colon exactly in the same cases as English or Russian.
2) Chapter 7.7 of The Unicode Standard 5.2 (European Alphabetic Scripts, Georgian, pg.222) states: «Other Punctuation. For the Georgian full stop, use U+0589 ARMENIAN FULL STOP or U+002E FULL STOP». This misleads readers: Georgian uses ASCII Punctuation and General Punctuation characters, but not Armenian ones (they form separate system of punctuation with own logic and rules - all of these could be also found in Armenian chapter 7.6 - pg.220-221).
Same issues are present in corresponding Unicode 6.0 documents.
I can provide all the required references and proof materials.
SOLUTION
To resolve this issue, the followed should be done:
1) Remove the comment «may also be used for Georgian» from U+0589 ARMENIAN FULL STOP in Armenian block (U0530).
2) Replace the phrase «Other Punctuation. For the Georgian full stop, use U+0589 ARMENIAN FULL STOP or U+002E FULL STOP» with «Other Punctuation. For the Georgian full stop, use U+002E FULL STOP» or remove «Other Punctuation» section at all there (since Georgian uses "common" punctuation, there will be no confusion).
Best regards,
Alex.
No feedback was received via the reporting form this period.
No feedback was received via the reporting form this period.
No feedback was received via the reporting form this period.
No feedback was received via the reporting form this period.
No feedback was received via the reporting form this period.
Date/Time: Sat Jul 3 05:21:17 CDT 2010
Contact: ezio.melotti@gmail.com
Name: Ezio Melotti
Report Type: Other Question, Problem, or Feedback
Opt Subject: Best practices for using U+FFFD
Hi, at page 95 of http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf there's a section about the best practices for using U+FFFD. There is however a corner case that is not considered. In some cases some of the continuation bytes are considered invalid depending on the value of the start byte, as described in table 3.7 at page 93 of the same document. For example, \x80 is a valid continuation byte, but not if it's the second byte and the start byte is \xE0 or \xF0.
So assume the UTF-8 sequence \xE0\x80\x81\x61, how should it be converted? The possible cases I can think of are: 1) replace the start byte and all the continuation bytes in range 80..BF with U+FFFD, i.e. U+FFFD U+0061; 2) replace the start byte and all the continuation bytes valid for that byte, i.e. U+FFFD U+FFFD U+FFFD U+0061, because \x80 is not valid when the start byte is \xE0.
What is the best option?
A point in favor of the first case is that bytes in range 80..BF are invalid when they are not used as continuation bytes, so there's no reason to not include them with the start byte and replace them all with a single U+FFFD. This also makes the code simpler. OTOH that document says: "A sequence of code units will be processed up to the point where the sequence either can be unambiguously interpreted as a particular Unicode code point or where the converter recognizes that the code units collected so far constitute an ill-formed subsequence. At that point, the converter can emit a single U+FFFD". So following this text it should stop at \x80 because the sequence is known to be invalid. I think that this corner case has not be considered, and that that section should be updated with another example.
This problem came out while fixing a Python issue: http://bugs.python.org/issue8271#msg109155
Best Regards,
Ezio Melotti
No feedback was received via the reporting form this period.
Date/Time: Fri Jul 16 18:28:49 CDT 2010
Contact: seanparry@ireland.com
Name: Sean Parry
Report Type: Error Report
Opt Subject: The Mercian Thorn is incorrectly omitted from the Unicode system
Dear Sir / Madam,
If I am correct I note that you do not currently have the Mercian Thorn in either upper or lower case represented within the Unicode system.
The Mercian Thorn is distintive and unique in the Anglo-Saxon period in having the down stroke on the curved back of of the bared upper case D rather than the standard form of the Upper case Thorn with the cross bar on front of D. As such the Mercian Thorn in both upper and lower case should certainly be added into the Unicode system.
I very much look forward to hearing from you further over this matter.
Sean Parry
* NOTE: Rick is already in correspondence with the author and has requested examples, etc.
Date/Time: Sat Jul 17 07:52:37 CDT 2010
Contact: manirajbaruah@yahoo.com
Name: Maniraj Baruah
Report Type: Other Question, Problem, or Feedback
Dear Sirs,
The Assamese character ক্ষ is not included in the unicode charts. At the moment we are using it as a conjunct or ligature using the sequence ক ্ ষ.
Will this not create sorting problems?
As per the Assamese charts ক্ষ should appear before ড় and after হ.
* NOTE: This has already been taken care of by the ed committee, and acknowledged.