The sections below contain links to permanent feedback documents for the open Public Review Issues as well as other public feedback as of January 24, 2017, since the previous cumulative document was issued prior to UTC #153 (October 2017). Some items in the Table of Contents do not have feedback here.
The links below go directly to open PRIs and to feedback documents for them, as of January 24, 2018.
The links below go to locations in this document for feedback.
Feedback to UTC / Encoding Proposals
Feedback on UTRs / UAXes
Error Reports
Other Reports
Note: The section of Feedback on Encoding Proposals this time includes:
L2/98-045
L2/02-388
L2/11-175R
L2/14-153
L2/15-004R
L2/17-077
L2/17-106R
L2/17-340
L2/17-345
L2/18-004
L2/18-010
L2/18-015
L2/18-017
L2/18-018
L2/18-019
L2/18-020
L2/18-025
L2/18-041
Date/Time: Sat Nov 11 19:06:19 CST 2017
Name: Eduardo Marín Silva
Report Type: Feedback on an Encoding Proposal (L2/17-345)
Opt Subject: MIAO SIGN CONSONANT MODIFIER BAR name
The name of this sign does not indicate its function clearly and the fact that this sign is a combining character is not expressed. A better name would be MIAO SIGN COMBINING CONTRAST OF ARTICULATION. If the comitee does not believe that a spacing version of this character will be proposed, then one could drop the term "combining" to get MIAO SIGN CONTRAST OF ARTICULATION.
Date/Time: Thu Dec 14 08:17:43 CST 2017
Contact: srinidhi.pinkpetals24@gmail.com
Name: Srinidhi A
Report Type: Feedback on an Encoding Proposal
Opt Subject: Feedback on L2/17-340
The document requests four objectives. While requests 2,3 and 4 are acceptable. I feel the request 1 is not necessary. Request 1 The chart for Common Indic Number Forms needs to add following to U+A830, U+A830, and U+A830: ● Used in Malayalam also Indic fractions are used in more than 15 Indic scripts. Since, they are used in many scripts, annotating only for Malayalam may not be appropriate. Currently, characters are named as NORTH INDIC, However are not limited to Northen India. At the starting of Code chart below 'Number forms' it may be mentioned as 'Fractions are also used in several scripts of South India' as indicated in page 802 of Core specification.
Date/Time: Sun Jan 7 21:41:59 CST 2018
Name: Eduardo Marín Silva
Report Type: Feedback on an Encoding Proposal (L2/18-010
Opt Subject: On the digits for the Khwarezmian script
It is argued in the proposal L2/18-010 that separate encoding of numbers onew two three and four is merited because other similar middle eastern scripts uses them. However the case can be made that those scripts needed the separate encoding, because the individual instances of the digit one are so close togueter that one would need kerning to get the right glyph in digital contexts. In cases like Old Sogdian, the glyphs fuse even though it is not a joining script. That is not to say that I necessarily would be aginst encoding the digits two three and four separetly. If it's true that the numbers 5-9 are represented in groups of 2, 3 and 4, then it would be convinient for implementers having them separate, because they wouldn't have to deal with an somewhat inconsistent set of rules for introducing the space. Then again, the author only represets such numbers using repetitions of one and spaces.
Date/Time: Wed Jan 10 17:54:00 CST 2018
Name: David Corbett
Report Type: Feedback on an Encoding Proposal
Opt Subject: Inconsistency in L2/18-010
L2/18-010 “Proposal to encode the Khwarezmian script in Unicode” contradicts itself about how to encode numbers: section 4.2 uses dedicated code points for 2 through 4 and section 5.3 uses repeated instances of the number 1.
Date/Time: Thu Jan 11 08:30:48 CST 2018
Name: Ken Lunde
Report Type: Feedback on an Encoding Proposal
Opt Subject: Feedback for L2/18-004 & L2/18-017
This report is feedback for L2/18-004 and L2/18-017. With regard to the six emphasized modern hangul syllables in the "Emphazised Hangul syllables" section, one work-around in lieu of encoding is to implement them via OpenType 'ccmp' GSUB feature using chaining contextual substitutions, as demonstrated using the code below that is in AFDKO "feature" file syntax: lookup DPRK_SUPREME_LEADERS { substitute uniAE40 by uniAE40.emphasis; substitute uniC131 by uniC131.emphasis; substitute uniC740 by uniC740.emphasis; substitute uniC77C by uniC77C.emphasis; substitute uniC815 by uniC815.emphasis; } DPRK_SUPREME_LEADERS; feature ccmp { substitute uniAE40' lookup DPRK_SUPREME_LEADERS uniC77C' lookup DPRK_SUPREME_LEADERS uniC131' lookup DPRK_SUPREME_LEADERS; substitute uniAE40' lookup DPRK_SUPREME_LEADERS uniC815' lookup DPRK_SUPREME_LEADERS uniC77C' lookup DPRK_SUPREME_LEADERS; substitute uniAE40' lookup DPRK_SUPREME_LEADERS uniC815' lookup DPRK_SUPREME_LEADERS uniC740' lookup DPRK_SUPREME_LEADERS; } ccmp; The 'ccmp' GSUB feature is broadly implemented, is on by default, and cannot be toggled off. With regard to the character in the "Enclosed postal mark symbol" section, it is present in Supplement 4 of Adobe's Adobe-Japan1-6 character collection (aka Japanese glyph set) as CID+12180. See the "Adobe-Japan1.6.pdf" PDF file here: https://github.com/adobe-type-tools/Adobe-Japan1/ Its source is Morisawa's glyph sets, MOR-CODE and MOR-CODE 2. Morisawa is Japan's leading type foundry.
Date/Time: Thu Jan 11 12:54:23 CST 2018
Name: David Corbett
Report Type: Feedback on an Encoding Proposal
Opt Subject: Feedback on Bopomofo i variants
According to L2/18-020 “Proposal to define Standardized Variation Sequences for BOPOMOFO LETTER I”, “The vertical bar form was used for the initial(聲母), and the horizontal stroke form was used for the final(韻母) at the early usage [...] Therefore, these two forms are different and should be distinguished for the early usage or the study on the reform of the Chinese Hanzi.” That distinction should not be encoded by a variation sequence. A variation sequence would be appropriate for just the modern usage, but not for the early usage where the two glyphs contrasted.
Date/Time: Fri Jan 12 21:40:11 CST 2018
Contact: nobody_uses@outlook.com
Name: Eduardo Marín Silva
Report Type: Feedback on an Encoding Proposal
Opt Subject: Name of newly proposed malayalam character (L2/18-015)
From the attestations provided and the description given, the MALAYALAM END OF TEXT MARK does not appear to be used consistently at the end of documents like its name would imply, but rather at the end of sections (including the final one) so a more fitting name would be MALAYALAM SECTION MARK.
Date/Time: Tue Jan 16 05:38:37 CST 2018
Name: Christoph Päper
Report Type: Feedback on an Encoding Proposal
Opt Subject: L2/18-018 Chess Emoji
The proposal [L2/18-018] is asking for a single emoji to represent the game of chess, but is rather unclear about exactly how this should look like (e.g. a checkered board) and why. Most other games and sports are represented as emojis by one or two characteristic utensils or, if that didn't seem viable, by a human-like figure exercising this activity with the tools required (e.g. Handball Player because a prototypical Handball is not sufficiently distinct from a Football or Volleyball). Some sports have two or more applicable emojis, e.g. Golf. In particular, the comparable game of Mahjong is represented by a single piece U+1F004 🀄 and all card games are represented by either a card back U+1F3B4 🎴, by a representative card face U+1F0CF 🃏 or by symbols of the four standard card suits ♣️♠️♥️♦️. The Joker Card and the Red Dragon Mahjong Piece are parts of larger sets of characters in their respective blocks which can be used for game notation and diagrams. Chess pieces have also long been available in Unicode for movement notation U+2654..F ♔♕♖♗♘♙♚♛♜♝♞♟. On Samsung devices, Unicode chess figure characters are rendered with emoji glyphs (i.e. centered within a square), which are more appropriate for board diagrams (in combination with emojis U+2B1B/C ⬛⬜ for empty board tiles) than for game notation. According to <http://cgi.wap2.jp/emoji/ezweb/?act=new_pict>, some (Sanyo/Kyocera etc.) phones distributed by KDDI in Japan from 2008 had the 12 standard chess pieces as emojis #577..588 (perhaps PUA U+F11F..F12B), but apparently not directly accessible for user input. It seems as if they have not been considered at all during the original "Emoji 4 Unicode" process. It is unclear whether that has been a deliberate choice or mere oversight. Given the Mahjong precedent in particular, someone at UTC might suggest to assign the `Emoji` property to one of the twelve existing chess figures (not counting upcoming fairy chess additions) and make this the general emoji for Chess. I must strongly advise against this option in foresight. This would surely disrupt the normal use of chess piece characters in some cases because variation selectors are often inserted by input methods in a way opaque to the user, or are ignored by output systems altogether (e.g. on Twitter), forcing emoji display if at all possible. If all standard chess pieces were emojified, however, users could pick one (or a larger selection) of them to represent the game itself, much like they can do with card suit emojis for arbitrary card games. They could also use them to draw board diagrams almost as desired and described by Michael Everson in [L2/17-077] but with VS-16 instead of VS-1/2 and without explicitly encoding the color of the underlying board tile. A basic diagram fits within a tweet on Twitter and people have been doing this with other emojis as stand-ins. Physical and virtual chess boards also frequently feature alternative designs for the pieces. @emojichess is a bot that generates bogus positions, but consider [Ocean Chess], [Plant Chess] or [Food Chess]. People use emojis to picture or even play related board games as well, e.g. [Checkers]. I believe this complete emojification would ultimately be the best option for chess players and other emoji users. The other pair of valid options is either accepting or rejecting the proposal for a dedicated chess emoji separate from other chess-related characters in Unicode. [Ocean Chess]: https://twitter.com/i/moments/929378210776797184 [Plant Chess]: https://twitter.com/IHStreet/status/929421079998758912 [Food Chess]: https://twitter.com/queeryeverythng/status/929941452146331649 [Checkers]: https://twitter.com/MissLilyRowan/status/928765637714989056 [L2/17-077]: http://www.unicode.org/L2/L2017/17077r-n4793r-chessboard.pdf [L2/18-018]: http://www.unicode.org/L2/L2018/18018-chess-emoji.pdf
Date/Time: Fri Jan 19 12:39:54 CST 2018
Name: David Corbett
Report Type: Feedback on an Encoding Proposal
Opt Subject:
Comparison of
L2/17-106R and
L2/18-041
L2/18-041 “Request to Add Thai Characters to ISO/IEC 10646/Unicode” proposes some characters for Tai Noi in the Thai block, but they should be unified with Lao. The glyphs look more like Lao than Thai, which makes sense as Lao is a simplification of Tai Noi, with obsolete characters removed. However, if L2/17-106R “Revised Proposal to Encode Lao Characters for Pali” is accepted, the Lao block will contain most of the characters needed by ISO 20674-1. Therefore, Tai Noi should be proposed as an extension to the Lao block. Since the Lao characters proposed in L2/17-106R are apparently not used only for Pali and Sanskrit, their character names should not include “PALI” and “SANSKRIT”.
Date/Time: Tue Jan 23 01:52:00 CST 2018
Name: John Cowan
Report Type: Feedback on an Encoding Proposal
Opt Subject: L2/18-019 Poison Emoji
I believe this should be an emojification of U+2620 SKULL AND CROSSBONES. It narrows the semantics (excludes pirates, e.g.), but that is a minor matter.
Date/Time: Tue Jan 23 17:32:33 CST 2018
Name: Richard Wordingham
Report Type: Feedback on an Encoding Proposal
Opt Subject: L2/18-041 (Encoding Tai Noi in Thai Script)
L2/18-041 "Request to Add Thai Characters (WG2 N4927)" is a proposal to encode Tai Noi (= Lao Buhan), the old form of the Lao script, as part of the Thai script. It should be read in conjunction with L2/18-042 to see the complete encoding. Most Tai Noi characters are identified with the cognate Thai character. 0. Non-Linguistic Background Notes The transliteration scheme is an extension of ISO 11940. I think it is worth noting that under ISO 11940, ฯ U+0E2F THAI CHARACTER PAIYANNOI is transliterated differently according to whether it is serving as an abbreviation mark (which is what the Unicode name designates) or as a minor section break ("angkhandeaw") in contrast to the major section break ๚ U+0E5A THAI CHARACTER ANGKHANKHU. It is therefore not essential for the transliteration scheme that Unicode make the distinction that the transliteration does. 1. Which Script? If it is appropriate to encode Tai Noi as part of any existing script, and I think it is, it is appropriate to encode it as part of the Lao script, and revivalist/popular antiquarian enthusiasts I was aware of were using the Lao script. If it is appropriate to encode it as part of the Thai script, then it is also appropriate to encode it as part of the Lao script. If Unicode decides to encode it as part of the Thai script, I believe we should also encode it as part of the Lao script. There might be good *political* reasons to encode it twice. 2. Character Identity The transliteration scheme makes three simple character distinctions without justifying them. Referring to Table 6, the distinctions are: 14 v. 69, transliterated as t̄h v. t̄h′. The former is identified with Thai 'ถ' while a new character *U+0E63 THAI NOI CHARACTER THA5. We should ask for evidence that these are different characters. 28. v. 73 transliterated as l v. l′. The former is identified with Thai 'ล', though it is the latter that is closer to modern Thai ล and Lao ລ. A new character *U+0E67 THAI NOI CHARACTER LA3 is proposed for the latter. 32 v. 74 transliterated as s̄ v. s̄′. The former is identified with Thai 'ส', though it is the latter that is closer to modern Thai ส and Lao ສ. A new character *U+0E68 THAI NOI CHARACTER SA6 is proposed for the latter. I do not believe we should encode the characters with '+' in the suggested names. They are ligatures, and would better be encoded as, for Thai: <[U+0E2B THAI CHARACTER HO HIP / U+0E02 THAI CHARACTER KHO KHAI/ U+0E04 THAI CHARACTER KHO KHWAI / U+0E16 THAI CHARACTER THO THUNG / U+0E2A THAI CHARACTER SO SUA ]>, U+200D ZERO WIDTH JOINER, [U+0E19 THAI CHARACTER NO NU / U+0E21 THAI CHARACTER MO MA] This was the conclusion of the analysis at https://linux.thai.net/~thep/esaan-scripts/tn-issues/tn-encoding.html . There was some discussion of the encoding in the general list thread starting at https://unicode.org/mail-arch/unicode-ml/y2014-m03/0202.html . Similar conclusions apply for Lao, except that U+0EDC LAO HO NO and U+0EDD LAO HO MO are already encoded. However, while new characters are not merited, we should make these ligatures named sequences. Glyph 78, transliterated as x′y, is proposed as the basis of *U+0E6C THAI NOI CHARACTER O+YA. It corresponds to Lao ຢ U+0EA2 LAO LETTER YO and functioally to the syllable-initial Thai sequence อย. If Tai Noi is to be encoded in the Thai script, this character should be encoded. Glyph 79, transliterated as xy′, is proposed as the basis of *0E6D THAI NOI CHARACTER O+YA. It corresponds to Lao ຽ in its obsolete rôle as both vowel and final consonant, and is functionally equivalent to the Thai rime อย. The transliteration standard refers to them as allographs because they have the same Thai transliteration, but as characters they are as different as 'c' and 'g'. We need evidence that glyph 48, the proposed *U+0E3D THAI NOI CHARACTER YA2, transliterated as ỵ, and glyph 71, the proposed *U+0E65 THAI NOI CHARACTER YA3, transliterated as ỵ′, are distinct characters. The former corresponds to one glyph of modern Lao ຽ. Note that both glyphs 48 and 79 correspond to the same encoded Lao character; we need evidence that the glyphs correspond to different characters in Tai Noi. If it is forthcoming, we may be faced with the unpleasant desirability of disunifying the two glyphs of U+0EBD LAO SEMIVOWEL SIGN NYO. We also have to consider the relationship with the sequence <U+1A60 TAI THAM SIGN SAKOT, U+1A3F TAI THAM LETTER LOW YA>. The requested *U+0E6E THAI NOI SARA A2 appears to be a glyph variant of U+0E30 THAI CHARACTER SARA A / U+0EB0 LAO VOWEL SIGN A. 3. Other Subscript Consonants Proposed *U+0E69 THAI NOI CHARACTER SA7 is the sequence <U+1A60 TAI THAM SIGN SAKOT, U+1A47 TAI THAM LETTER HIGH SSA>. Khun Theppitak's analysis mentioned above records many more. How we encode them is debatable. Perhaps we should even create a set of final consonant marks to be shared between Thai, Lao and Tai Tham that will enable the USE to render Tai Tham happily! (I, for one, don't like that solution.) A countervailing principle is the separation of scripts.
Date/Time: Wed Jan 24 12:00:08 CST 2018
Name: Theppitak Karoonboonyanan
Report Type: Feedback on an Encoding Proposal
Opt Subject: Comments on L2/18-041 "Request to Add Thai Characters"
As Tai Noi (aka. Lao Buhan) is an old Lao script which has evolved into contemporary Lao script, I think it is more appropriate to add the characters to Lao block than to Thai. Doing so could save many code points which are already encoded there, such as: - U+0E3B THAI NOI MAI KONG ~ U+0EBB LAO VOWEL SIGN MAI KON - U+0E3C THAI NOI CHARACTER LA2 ~ U+0EBC LAO SEMIVOWEL SIGN LO - U+0E3D THAI NOI CHARACTER YA2 ~ U+0EBD LAO SEMIVOWEL SIGN NYO - U+0E5C THAI NOI CHARACTER HA1+NA ~ U+0EDC LAO HO NO - U+0E5D THAI NOI CHARACTER HA1+MA ~ U+0EDD LAO HO MO - U+0E63 THAI NOI CHARACTER THA5 ~ U+0E96 LAO LETTER THO SUNG - U+0E65 THAI NOI CHARACTER YA3 ~ U+0EBD LAO SEMIVOWEL SIGN NYO - U+0E67 THAI NOI CHARACTER LA3 ~ U+0EA5 LAO LETTER LO LOOT - U+0E68 THAI NOI CHARACTER SA6 ~ U+0EAA LAO LETTER SO SUNG - U+0E6C THAI NOI CHARACTER O+YA ~ U+0EA2 LAO LETTER YO - U+0E6E THAI NOI SARA A2 ~ U+0EB0 LAO VOWEL SIGN A The "~" sign above means the former character is already encoded as the latter, while some character pairs may have different shapes due to the evolution. Note that the proposed U+0E3D THAI NOI CHARACTER YA2 and U+0E65 THAI NOI CHARACTER YA3 are considered the same character with different styles. With this number of duplications, it should be obvious that Tai Noi should belong in Lao block rather than in Thai. It is worth noting that U+0E6D THAI NOI CHARACTER O+YA is analogous to U+1A6D TAI THAM VOWEL SIGN OY. Probably, it deserves a more sensible name like "THAI NOI VOWEL SIGN OY" or "THAI NOI SARA OY". I have collected more issues found while reading Tai Noi manuscripts here: https://linux.thai.net/~thep/esaan-scripts/tn-issues/tn-encoding.html Yours sincerely, Theppitak Karoonboonyanan.
Date/Time: Tue Nov 28 13:12:38 CST 2017
Name: Solra Bizna
Report Type: Error Report
Opt Subject: UAX #9, rule X8 could be more clear
> X8. All explicit directional embeddings, overrides and isolates are > completely terminated at the end of each paragraph. Paragraph separators are > not included in any embedding, override or isolate, and are thus assigned > the paragraph embedding level. This rule specifies an important, but easy to miss, behavior. In every other usage of "end of paragraph" in the document, straightforwardly including paragraph separators as part of the preceding paragraph results in the correct behavior. However, if this definition is used when applying X8, the paragraph separator might end up being assigned the current embedding level rather than the paragraph embedding level, or not being assigned any embedding level at all. The second sentence of the rule does in fact explicitly state that paragraph separators are assigned the paragraph embedding level. However, it's phrased in a way that makes it seem like it's clarifying consequences of existing rules, rather than specifying a new rule. It also does not refer to paragraph separators as `B`, which means that someone who has just read rule X6 (which excludes `B`) and is now searching the document for a rule that assigns an embedding level to `B` may very well not find rule X8. Perhaps the following wording would be better: > X8. All explicit directional embeddings, overrides and isolates are > completely terminated upon encountering a paragraph separator (B) or the end > of the paragraph. Since this prevents paragraph separators from being > included in any embedding, override, or isolate, they are thus always > assigned the paragraph embedding level.
Date/Time: Tue Oct 10 06:46:20 CDT 2017
Name: Srinidhi A,Sridatta A
Report Type: Error Report
Opt Subject: Indic syllabic category of Kharoshthi virama
In Indic_Syllabic_Category *Virama is assigned to only includes characters that can act both as visible killer viramas and consonant stackers. *Pure_Killer is assigned for characters that can only act as pure killers ((killing of inherent vowel in consonant sequence,with no consonant stacking behavior) *Invisible_Stacker is assigned for characters that can only as consonant stackers. KHAROSHTHI VIRAMA is currently assigned property as Invisible_Stacker. Unlike other Indic scripts Kharoshthi does not have any visible form virama which acts as a halanta. When not followed by a consonant, the virama causes the preceding consonant to be written as subscript to the left of the letter preceding it. If followed by another consonant, the virama will trigger a combined form consisting of two or more consonants.(from Kharoshthi section of Core specification, page 564-565) Since Virama in Kharoshthi can act as both halanta(killing of inherent vowel) and form consonant conjuncts. The current property is incorrect. What is appropriate property? Should it be changed to Indic_Syllabic_Category=Virama.
Date/Time: Mon Oct 30 09:45:21 CDT 2017
Name: David Corbett
Report Type: Error Report
Opt Subject: Indic categories of U+20F0 COMBINING ASTERISK ABOVE
U+20F0 COMBINING ASTERISK ABOVE has scx={Deva Gran Latn} because it is used as a svara marker. It should also have Indic_Syllabic_Category=Cantillation_Mark and Indic_Positional_Category=Top.
Date/Time: Sun Nov 12 16:22:17 CST 2017
Name: David Corbett
Report Type: Error Report
Opt Subject: Underspecified Soyombo vowel signs
A Soyombo consonant may take multiple vowel signs, all of which have ccc=0 and Indic_Syllabic_Category=Vowel_Dependent. The Unicode Standard does not specify the order. The proposal (L2/15-004R) recommends the order V_sign M_length V_diphthong. The standard should specify this order explicitly, instead of the vague wording in the “Vowels and Diphthongs” section. This goes for Zanabazar Square too.
Date/Time: Mon Nov 13 13:42:08 CST 2017
Name: David Corbett
Report Type: Error Report
Opt Subject: Typo in the section on Kayah Li
In the Kayah Li section, two of the vowels are written ⟨o’⟩ and ⟨u’⟩. They should be ⟨ơ⟩ and ⟨ư⟩.
Date/Time: Sun Nov 19 11:52:38 CST 2017
Name: David Corbett
Report Type: Error Report
Opt Subject: Indic_Syllabic_Category of U+0C80
According to L2/14-153, U+0C80 KANNADA SIGN SPACING CANDRABINDU can be followed by U+0C82 KANNADA SIGN ANUSVARA, so its Indic_Syllabic_Category should be Bindu.
Date/Time: Sun Nov 19 12:30:42 CST 2017
Name: David Corbett
Report Type: Error Report
Opt Subject: Script_Extensions of U+1CF2 VEDIC SIGN ARDHAVISARGA
According to L2/11-175R, U+1CF2 VEDIC SIGN ARDHAVISARGA is used in Tirhuta, so its Script_Extensions should include Tirhuta.
Date/Time: Sun Nov 19 12:45:32 CST 2017
Name: David Corbett
Report Type: Error Report
Opt Subject: Typo in the Meetei Mayek Extensions chart
The header for Meetei Mayek Extensions includes the word “Manupuri”, which should be “Manipuri”.
Date/Time: Sun Nov 19 13:47:08 CST 2017
Name: David Corbett
Report Type: Error Report
Opt Subject: Unspecified order of Meetei Mayek vowel signs
The Unicode Standard 10.0, chapter 13, page 541 says that in a Meetei Mayek abbreviation, a consonant may have multiple vowel signs. The order of the vowel signs is not specified. Is it pronunciation order or visual left–top–bottom–right order? That section also says that “[i]n such cases, the vowel matra may occur at the end of a word”, implying that in other cases, the vowel matra may not occur at the end of a word, which is not true.
Date/Time: Tue Nov 21 13:47:38 CST 2017
Name: David Corbett
Report Type: Error Report
Opt Subject: Indic_Syllabic_Category of U+A8B4 SAURASHTRA CONSONANT SIGN HAARU
U+A8B4 SAURASHTRA CONSONANT SIGN HAARU is a modifier letter, kind of like a spacing nukta. It can be followed by a vowel sign or virama. Therefore, its Indic_Syllabic_Category should not be Consonant_Final. I suggest Consonant_Medial, although that may not be the best choice as it is not a consonant per se.
Date/Time: Wed Nov 22 13:50:41 CST 2017
Name: David Corbett
Report Type: Error Report
Opt Subject: Syloti Nagri dvisvara and anusvara
The standard’s introduction to Syloti Nagri says the script has 27 consonants, 5 independent vowels, 5 dependent vowels, and 2 “proper diacritics” (anusvara and hasanta), but nowhere does it mention U+A802 SYLOTI NAGRI SIGN DVISVARA. U+A802 is, in general, underspecified. U+A802 is a dependent vowel that can cooccur with other dependent vowels. It has no Indic_Syllabic_Category or Indic_Positional_Category. It seems appropriate for it to be in the categories Vowel_Dependent and Top. L2/02-388 says U+A802 and U+A80B SYLOTI NAGRI SIGN ANUSVARA can each appear before (rarely) or after (usually) the vowel sign a. This is unusual for an Indic script in Unicode. It complicates the usual model, where a top vowel precedes a post-base vowel and a post-base vowel precedes a bindu, and any other order is an error. The standard should therefore explicitly explain these exceptions. Alternatively, if it is determined that the behavior in the proposal is not to be promoted, the standard should say that dvisvara and anusvara may each appear in two positions relative to the vowel sign a, but that either way the *svara is encoded after the vowel sign.
Date/Time: Tue Nov 28 07:24:38 CST 2017
Name: Songchyuan Liou
Report Type: Error Report
Opt Subject: Tai Viet(AA80–AADF)'s character names
I request for consideration to change the character names. These errors can be obviously discovered if you compare the Tai Viet letters to the related Thai and Lao letters. ("LOW" and "HIGH" are reversed.) TAI VIET LETTER LOW KO → TAI VIET LETTER HIGH KO TAI VIET LETTER HIGH KO → TAI VIET LETTER LOW KO TAI VIET LETTER LOW KHO → TAI VIET LETTER HIGH KHO TAI VIET LETTER HIGH KHO → TAI VIET LETTER LOW KHO TAI VIET LETTER LOW KHHO → TAI VIET LETTER HIGH KHHO TAI VIET LETTER HIGH KHHO → TAI VIET LETTER LOW KHHO TAI VIET LETTER LOW GO → TAI VIET LETTER HIGH GO TAI VIET LETTER HIGH GO → TAI VIET LETTER LOW GO TAI VIET LETTER LOW NGO → TAI VIET LETTER HIGH NGO TAI VIET LETTER HIGH NGO → TAI VIET LETTER LOW NGO TAI VIET LETTER LOW CO → TAI VIET LETTER HIGH CO TAI VIET LETTER HIGH CO → TAI VIET LETTER LOW CO TAI VIET LETTER LOW CHO → TAI VIET LETTER HIGH CHO TAI VIET LETTER HIGH CHO → TAI VIET LETTER LOW CHO TAI VIET LETTER LOW SO → TAI VIET LETTER HIGH SO TAI VIET LETTER HIGH SO → TAI VIET LETTER LOW SO TAI VIET LETTER LOW NYO → TAI VIET LETTER HIGH NYO TAI VIET LETTER HIGH NYO → TAI VIET LETTER LOW NYO TAI VIET LETTER LOW DO → TAI VIET LETTER HIGH DO TAI VIET LETTER HIGH DO → TAI VIET LETTER LOW DO TAI VIET LETTER LOW TO → TAI VIET LETTER HIGH TO TAI VIET LETTER HIGH TO → TAI VIET LETTER LOW TO TAI VIET LETTER LOW THO → TAI VIET LETTER HIGH THO TAI VIET LETTER HIGH THO → TAI VIET LETTER LOW THO TAI VIET LETTER LOW NO → TAI VIET LETTER HIGH NO TAI VIET LETTER HIGH NO → TAI VIET LETTER LOW NO TAI VIET LETTER LOW BO → TAI VIET LETTER HIGH BO TAI VIET LETTER HIGH BO → TAI VIET LETTER LOW BO TAI VIET LETTER LOW PO → TAI VIET LETTER HIGH PO TAI VIET LETTER HIGH PO → TAI VIET LETTER LOW PO TAI VIET LETTER LOW PHO → TAI VIET LETTER HIGH PHO TAI VIET LETTER HIGH PHO → TAI VIET LETTER LOW PHO TAI VIET LETTER LOW FO → TAI VIET LETTER HIGH FO TAI VIET LETTER HIGH FO → TAI VIET LETTER LOW FO TAI VIET LETTER LOW MO → TAI VIET LETTER HIGH MO TAI VIET LETTER HIGH MO → TAI VIET LETTER LOW MO TAI VIET LETTER LOW YO → TAI VIET LETTER HIGH YO TAI VIET LETTER HIGH YO → TAI VIET LETTER LOW YO TAI VIET LETTER LOW RO → TAI VIET LETTER HIGH RO TAI VIET LETTER HIGH RO → TAI VIET LETTER LOW RO TAI VIET LETTER LOW LO → TAI VIET LETTER HIGH LO TAI VIET LETTER HIGH LO → TAI VIET LETTER LOW LO TAI VIET LETTER LOW VO → TAI VIET LETTER HIGH VO TAI VIET LETTER HIGH VO → TAI VIET LETTER LOW VO TAI VIET LETTER LOW HO → TAI VIET LETTER HIGH HO TAI VIET LETTER HIGH HO → TAI VIET LETTER LOW HO TAI VIET LETTER LOW O → TAI VIET LETTER HIGH O TAI VIET LETTER HIGH O → TAI VIET LETTER LOW O
Date/Time: Fri Dec 1 08:14:41 CST 2017
Name: David Corbett
Report Type: Error Report
Opt Subject: Default ignorability of U+1D159 MUSICAL SYMBOL NULL NOTEHEAD
Should U+1D159 MUSICAL SYMBOL NULL NOTEHEAD be default ignorable? This question came up last year on the mailing list. The response (http://unicode.org/pipermail/unicode/2016-September/003953.html) was that it shouldn’t, because “it is essentially just a base for applying the various combining stems and flags for a display without showing a particular notehead, analogous to applying a generic combining mark to a NBSP to show that combining mark in isolation.” That sounds reasonable, but it is not what the standard says. The standard only mentions it when discussing the musical format characters: “In some exceptional cases, beams are left unclosed on one end. This status can be indicated with a U+1D159 MUSICAL SYMBOL NULL NOTEHEAD character if no stem is to appear at the end of the beam.” The proposal (L2/98-045) says the same: “In some exceptional cases, beams are left-unclosed on one end. This can be indicated with a "null note" (0001 xx92 WESTERN MUSICAL SYMBOL NULL NOTEHEAD) character if no stem is to appear at the end of the beam.” If U+1D159 is only meant to be used with the control characters, which are default ignorable, then it too should be default ignorable. If it is meant to be a spacing invisible notehead that can take combining marks, the standard or the code chart should say so. As it is, it is not clear whether it should be zero-width or not.
Date/Time: Sun Dec 31 01:15:34 CST 2017
Name: Manish Goregaokar
Report Type: Error Report
Opt Subject: Mentioning the alternative name of U+06A9
U+06A9 ARABIC LETTER KEHEH currently says "Persian, Arabic, ..." in its description. This character is also used in Sindhi and it's worth mentioning this, especially since the name "keheh" comes from the Sindhi name. This letter is typically called the "Kaf Mashkula" (https://en.wikipedia.org/wiki/Kaph#Arabic_k%C4%81f), this information should be in the description as well so that it can be found by that name.
Date/Time: Wed Jan 3 07:45:32 CST 2018
Name: Huáng Jùnliàng
Report Type: Error Report
Opt Subject: Unicode Space Characters Table 6.2 should not contains U+180E
The Table 6-2. Unicode Space Characters on page 268 of Unicode10.0.0 specification lists U+180E as one of space characters. As is stated, “The space characters in the Unicode Standard can be identified by their General Category, [gc=Zs]”. However, the U+180E Mongolian Vowel Separator has changed General Category from Zs to Cf since Unicode 6.3.0[1][2], we should respect this change and remove U+180E from the table. [1] http://www.unicode.org/L2/L2013/13004-vowel-sep-change.pdf [2] http://www.unicode.org/reports/tr44/tr44-14.html
Date/Time: Mon Jan 15 17:18:05 CST 2018
Name: Behnam Esfahbod
Report Type: Error Report
Opt Subject: Error in TUS Table 18-1. Blocks Containing Han Ideographs, 7th Row
Unicode Blocks are defined as "a uniquely named, continuous, non-overlapping range of code points, containing a multiple of 16 code points". [https://www.unicode.org/glossary/#block] "CJK Unified Ideographs Extension F" is defined as `2CEB0..2EBEF`. [http://ftp.unicode.org/Public/10.0.0/ucd/Blocks.txt] In The Unicode Standard, Table 18-1. Blocks Containing Han Ideographs has the "Range" column which appears to be showing the ranges used to define Unicode Blocks, named in the first column, "Block". The "Range" column has the correct value for all the Blocks listed, except "CJK Unified Ideographs Extension F", which is shown as "2CEB0–2EBE0", instead of "2CEB0–2EBEF" (notice the "F" instead of "0" as LSD of the range's end codepoint). It's true that U+2EBE0 is the last *assigned* codepoint of this Block, but from the context, it doesn't look like that's what the column represents. Similarly, Block "CJK Unified Ideographs Extension A" has U+4DB5 as the last *assigned* codepoint, but the correct range end value (`4DBF`) is listed in the table. Suggested Correction: Update the 7th row of the table to: ``` | CJK Unified Ideographs Extension F | 2CEB0–2EBEF | Rare, historic | ```
Date/Time: Sat Jan 20 12:50:17 CST 2018
Name: Andrew West
Report Type: Error Report
Opt Subject: Underspecified Zanabazar Square vowel signs
Whilst designing and implementing a font for the Zanabazar Square script encoded in Unicode 10.0 I have encountered the following issue (see also report by David Corbett on 12 November 2017). Zanabazar Square consonants may take multiple vowel signs and a vowel length mark (U+11A0A), but all have ccc=0 so different (but visually identical) sequences of consonant plus multiple vowel signs and length mark do not normalize to a canonical order. The Unicode Standard does not specify a correct order of vowel signs, and Anshuman Pandey's proposal to encode Zanabazar Square script (http://www.unicode.org/L2/L2015/15337-zanabazar- square.pdf) is inconsistent on the order of vowel signs and the placement of the vowel length mark, e.g. on pages 5-6 the vowel length mark is placed after the vowel signs I, UE, U, E, OE, O, and Reversed I, but before the vowel signs AI and AU, and between two vowel signs in some sequences. I find it odd to put the vowel length mark after a vowel sign (and even odder to put it between two vowel signs) because the mark corresponds to Tibetan a-chung (U+0F71) which is placed between consonant and vowel signs. Furthermore the Zanabazar length mark attaches (ligates) to the preceding consonant so I feel intuitively that it belongs after the consonant and before any vowel signs. From a font implementation point of view it is much easier to deal with the length mark using OpenType substitutions if it is not separated from the preceding consonant by one or more vowel signs. I would like the Unicode Standard to specify the encoding order of vowel signs and vowel length mark so that end users and implementers can have a common understanding of how to write the Zanabazar Square script. In particular I would like to define the vowel length mark as coming before any vowel sign, i.e. <consonant> [<subjoiner> <consonant>]* [<cluster final letter>] [<length mark>] [<vowel sign>]*.
Date/Time: Sat Nov 4 04:59:34 CDT 2017
Name: Marlen
Report Type: Other Question, Problem, or Feedback
Opt Subject: I with a dot, I without a dot
Please consider the introduction of additional symbols associated with the variants of "I i" with a dot and without a dot. In some Turkic languages, the symbols "I i" are used in two variants: "I ı" (I without a dot) and "İ i" (I with a dot). At the abstract level, there are 3 pairs of different symbols: I standard (I i), I with a dot (İ i), I without a dot (I ı). But in Unicode there are additional characters only for "ı" (lowercase I without a dot) and "İ" (uppercase I with a dot). When using ignorecase soft, this can lead to serious problems and misunderstandings. This problem could be solved this way. The introduction of an additional "I" (uppercase I without a period) as the uppercase for the symbol "ı" (lowercase I without a dot), with a different code than the standard "I". And the introduction of an additional "i" (lowercase I with a period) as a lower case for the "I" symbol, which differs from the standard "i" code.
Date/Time: Fri Nov 10 17:16:01 CST 2017
Report Type: Error Report
Opt Subject: Minor error in UAX #14
Section 5 of "UAX #14: Unicode Line Breaking Algorithm" refers to LineBreak.txt as tab-delimited. LineBreak.txt is, in fact, semicolon-delimited. (I did say it was minor. :) )
Date/Time: Mon Dec 18 15:30:34 CST 2017
Name: Umihotaru Sasea
Report Type: Feedback on an Encoding Proposal
Opt Subject: MAYAN vs MAYA
The term "Mayan Numerals" should be changed to "Maya Numerals" (to harmonize with the term "Maya Hieroglyphs" found in the roadmap table), as scholars use MAYA rather than MAYAN for an adjective of Maya. See: http://www.osea-cite.org/program/maya_or_mayans.php https://www.thoughtco.com/ancient-maya-mayans-most-accepted-term-171569 https://www.belize.com/maya-or-mayan
Date/Time: Wed Dec 20 05:35:36 CST 2017
Name: Marcel Schneider
Report Type: Error Report
Opt Subject: French typesetting and Unicode
Hello, below is a piece of formal personal feedback, replacing my previous post that was an information request but would have been handled as general feedback for the next UTC meeting. Thank you for bringing this to the attention of the UTC so that it hopefully can be settled just in time when French keyboard layouts must be defined for public release. Best regards, Marcel _______________________________________________________________ • French typesetting not natively supported U+202F NARROW NO-BREAK SPACE is kind of recommended to us for typeseting of a set of French punctuations only since 2014 and version 7.0 of the Standard, when the section on spaces in chapter 6 was edited. How was French supposed to be typeset before? Unicode specifies almost all space characters to be breakable. That spec is disruptive, as the em, two-per-em and four-per-em spaces were non-breakable in phototypesetting. Thus, the four-per-em space was to be used to surround punctuations on the no-break side. This practice has been disrupted by Unicode, as nobody can input interoperable text any longer. Note however that word processing software stays handling these spaces as non-breakable, to conform to user expectations inherited from pre-Unicode practice. Further, the Unicode narrow no-break space U+202F was encoded for Mongolian and used in Phags-pa too, as being close in typesetting practice and requirements. That took place no sooner than in version 3.0 published in September, 1999, seven years after v1.1. Supposed that French implementers are eager to spread best practices, how were we supposed to typeset French text before being able to use the only fixed-width no-break space in Unicode, that seems to be U+202F? However, pre-Unicode typesetting seems to have been unusable already, as one needs a justifying no-break space as well, not only a bunch of fixed-width ones. Unicode specifies U+00A0 as being justifying. That in turn makes it unfit for French punctuation typesetting. Word processors work around by providing that space as fixed-width, while publishing software provides two no-break spaces of same default width, one justifying, the other fixed-width. Unicode seems not to have encoded the latter, so that the goal of empowering people to get interoperable text is unreached, be it by design, or by mistake. • Interoperable abbreviation typesetting unsupported As far as space characters are involved, Unicode applies the universality design principle by allowing re-use of U+202F for French, where its width is doubled to fit its new purpose. On the other hand, Unicode prohibits re-use of superscript Latin letters for interoperable French abbreviation typesetting, urging people not to use characters otherwise than intended. Other languages like English, Italian, Portuguese and Spanish are concerned as well, however not to the same extent. Simultaneously, another Unicode design principle, stipulating that significant differences in appearance or behavior must be handled by different characters, not different fonts, is contradicted by endorsing that an approx. nine..twelve-per-em space, as is U+202F in Mongolian, is used in other scripts as an equivalent of a four..six-per-em space. On the other hand, re-use of a subset of the spacing modifier letters encoded partly for phonetic transcription, partly for medievist usage, is outlawed, despite the new purpose of robust abbreviation typesetting is entirely coherent with the originally intended use, as well in practice (use of superscript in abbreviations goes back to medieval Latin handwriting), as in glyph shapes (superscript modifier letters must be evenly shaped by specification, as more than the whole Latin base alphabet may be used in phonetics) and in character properties. Now, should the narrow no-break space encoded for Mongolian be used for French as well, and the superscript Latin letters encoded for phonetics should not? Unicode is known to be eager to support legacy practice and make for round-trip conversion between the new and the old standards. All and any legacy characters presented during its first years made their way into the Standard. 1) How could it happen that spaces like four-per-em space were encoded as breakable? Making them non-breakable means surrounding them with word-joiners, so that we get three characters instead of one single character. In the same spirit, Unicode could have encoded all diacriticized letters as combining sequences only. 2) Why did Unicode pick only one of the two no-break spaces used in desktop publishing practice, and left the fixed-width one to private use? As a consequence, interoperable plain text cannot be exported from that software, as both spaces are merged into one single character. That contradicts the Unicode design principle of allowing all basic semantics to be represented in plain text. 3) What made Unicode overlook that users must be enabled to robustly typeset not only Italian, Spanish and Portuguese ordinal indicators, but also English and French ones, as well as other abbreviations? Whenever the best plain-text representation is an ugly and non-conformant fallback, Unicode is not supporting plain text representation of that language. Things would be different today if Unicode had supported French from the beginning on, specifying what space to use for typesetting French punctuation, like it did specify peculiar features for many languages, and encoding what is needed for robust representation of the language in plain text, as it did encode many many special letters and format characters to correctly represent in plain text every single language it tackled to support. Several important pieces of information are missing for a streamlined Unicode education. Knowing about the whys and hows would make documentation much more straightforward. Making Unicode more conformant to its design principles would make the Standard better usable. Thatʼs what we expect. It basically requires to correct one biased policy so that non-conformant fonts can be rejected, and to assume that the industry needed to hack the repertoire to get French correctly supported. There is still a lot of trouble telling people a quarter of a century after Unicode has been brought to us, that we are now coming up with the right space for our punctuation. French people could assume that it was in Unicode from the beginning on. In fact, it wasnʼt. Further reading: http://www.unicode.org/mail-arch/unicode-ml/y2017-m01/0119.html Please complete with: http://www.unicode.org/mail-arch/unicode-ml/y2017-m04/0278.html _______________________________________________________________
Date/Time: Sat Jan 20 06:22:41 CST 2018
Name: Marcel Schneider
Report Type: Other Question, Problem, or Feedback
Opt Subject: Angle brackets
Reviewing L2/18-025 drove my attention to angle brackets. Iʼm unable to retrieve the rationale of the canonical equivalence of non-CJK angle brackets U+2329 U+232A with CJK angle brackets U+3008 U+3009. TUS suggests that this canonical equivalence has been, which implies that it is no longer. However, due to stability guarantee of canonical equivalence, non-CJK angle brackets have been subject of duplicate encoding to recover the use of angle brackets in non-CJK contexts (U+27E8 U+27E9). We need to document the point in using “mathematical” angle brackets in ordinary text. That is not done by pointing a canonical equivalence without documenting that canonical equivalence itself. Presumably, making these characters canonically equivalent was an encoding error, and should be declared as such for transparency. Moreover, TUS uses the term “angle brackets” when referring to the ASCII chevrons LESS-THAN and GREATER-THAN. That is confusing. According to Wikipedia, these are mainly called “pointy brackets”: https://en.wikipedia.org/wiki/Bracket
Date/Time: Sat Jan 20 12:24:10 CST 2018
Name: Marcel Schneider
Report Type: Other Question, Problem, or Feedback
Opt Subject: Typographic dashes disambiguation
[[NOTE: This feedback supersedes the previously sent with same subject.]] Reviewing L2/18-025 drove my attention to U+2015 and U+2012. U+2015 Properly designed fonts like Cambria give U+2015 a four-per-three-em width. The confusion among users is due to improperly designed fonts that give it the same length as the em-dash, making it useless in practice. But that is just a flaw in fonts, fueled by the confusing names in the Unicode Standard suggesting that there was no means to give U+2015 a name like those of U+2013 and U+2014, as if the difference is only in semantics, not in typography. For consistency, U+2015 should have been given a name based on advance width, or length, like U+2013 and U+2014. This has already been taken into account in the draft French translation, where on 2017-12-26 it had been renamed to TIRET TROIS QUARTS DE CADRATIN, that translates to FOUR-PER-THREE-EM DASH, according the English-style naming. Short form would be 4/3M, or more arithmetically, 3/4 M, or ¾M. HORIZONTAL BAR is confusing also in that, it suggests a tie to U+007C VERTICAL BAR. The HORIZONTAL BAR label is likely to be a last-resort choice, while the actual length of this dash was still flawed by fonts. Hence, Unicode added the informative alias “quotation dash.” An annotation should be added to U+2015 preventing further confusion. When an em-dash is too long, and a half-em dash (en dash) is too short, the three-quarter-em dash is right. U+2012 FIGURE DASH seems to be part of a set taken over from legacy typesetting of figure tables: U+2007 FIGURE SPACE, U+2008 PUNCTUATION SPACE, and U+2012 FIGURE DASH. These three allow for roundtrip compatibility with older standards from the time when figure (function) tables were already computed while typesetting was still done in hot metal, and would have been useful when data was output for Linotype. Anyhow, the Unicode Standard recommends the use of U+2013 EN DASH to denote intervals. An alternate recommendation is found on technicalauthoring.com: http://www.technicalauthoring.com/wiki/index.php/Figure_dash Wikipedia indicates the reason why the figure dash is preferred for intervals: https://en.wikipedia.org/wiki/Dash#Similar_Unicode_characters It is the same rationale as for hyphen vs minus sign: The latter is centered on uppercase digits, the former on lowercase letters. That leads to reconsider the Unicode recommendation of U+2013 for intervals in ordinary text (as opposed to technical notation using two dots). Based on the Unicode Standard, Iʼve taken U+2012 off the keyboard layout, but this new evidence might lead to remap it instead of the en dash in the numbers level. What is actual/new Unicode Policy as of proper representation of intervals?