The sections below contain links to permanent feedback documents for the open Public Review Issues as well as other public feedback as of July 29,2014, since the previous cumulative document was issued prior to UTC #139 (May 2014). This document does not include feedback on moderated Public Review Issues from the forum that have been digested by the forum moderators; those are in separate documents for each of the PRIs. Grayed-out items in the Table of Contents do not have feedback here.
The links below go to directly to open PRIs and to feedback documents for them, as of July 29, 2014. Gray rows have no feedback to date.
Issue Name Feedback Link 278 Proposed Update UTR #50, Unicode Vertical Text Layout (feedback) 277 Reconciling Script and Script_Extensions Character Properties (feedback) 276 Feedback on repertoire for ISO/IEC 10646:2014 (4th Edition, Amendment 2) (feedback) 273 Proposed Update UTS #39, Unicode Security Mechanisms (feedback) 272 Proposed Update UTR #36, Unicode Security Considerations (feedback)
The links below go to locations in this document for feedback.
Feedback on Encoding Proposals
Feedback on UTRs / UAXes
Error Reports
Other Reports
None at this time.
Date/Time: Thu May 22 08:01:18 CDT 2014
Name: Anne van Kesteren
Report Type: Public Review Issue
UTS#46
Opt Subject: Domain syntax
I just wanted to clarify something with regards to the review note at the end of http://www.unicode.org/reports/tr46/proposed.html#Implementation_Notes What I'd really like to see is a syntax description. E.g. a domain consists of domain labels separated from each other by domain label separators, optionally with a trailing domain label separator. A domain label is a sequence of one or more code points which are one of X, Y, and Z. A domain label separator is one of X, Y, and Z. Alternatively you could express this using ABNF or some kind of grammar. That is the kind of thing people writing validators or authoring tools are often looking for. And often web developers as well. They don't want to have to put some input they made up through a series of functions before they know whether the input is valid. I guess another way of saying this would be having a declarative description of a domain. (This is an open issue https://www.w3.org/Bugs/Public/show_bug.cgi?id=25334 for the URL Standard.)
Date/Time: Wed May 28 15:54:40 CDT 2014
Name: Richard Wordingham
Report Type: Error Report
UTS #18
Opt Subject: Definition of Unicode Set in Unicode Regular Expressions
Unicode Technical Standard #18 'Unicode Regular Expressions' Revision 17 refers to Unicode sets, but does not define them. I have been told that the definition is meant to be taken from UTS#35, the LDML specification, and that there ought to be a cross-reference to that definition. Section 1.3 of UTS#18 contains two examples, "[\p{L}--QW]" and "[\p{Assigned}--\p{Decimal Digit Number}--a-fA-Fa-fA-F]", which appear not to conform to the LDML syntax. Further details are given at http://unicode.org/cldr/trac/ticket/7507 .
Date/Time: Sat Jun 7 14:23:13 CDT 2014
Name: Dmitry S.
Report Type: Error Report
Opt Subject: Possible typo in UTR #31
Hello, In http://www.unicode.org/reports/tr31/ clause R7 says: "R7 Filtered Case-Insensitive Identifiers To meet this requirement, an implementation shall specify either simple or full case folding, and adhere to the Unicode specification for that folding. Except for identifiers containing excluded characters, allowed identifiers must be in the specified Normalization Form." Is a Normalization Form truly meant here or is it a case-folding form? Thanks, Dmitry S.
Date/Time: Wed Jun 11 18:50:32 CDT 2014
Name: Norbert Lindenberg
Report Type: Error Report
Opt Subject: Inconsistency wrt/ variation selectors in UAX 31
Unicode Standard Annex 31, UNICODE IDENTIFIER AND PATTERN SYNTAX, is inconsistent in its description of variation selectors: - Section 2.3 describes the risks associated with variation selectors (and other default-ignorable characters), and says “Variation selectors ... are not included in the default identifier syntax”, and “default-ignorable characters are normally excluded from Unicode identifiers”. - Section 2, however, includes all nonspacing marks into ID_Continue, and does nothing to exclude variation selectors, which are nonspacing marks. And indeed, DerivedCoreProperties.txt does have the entries 180B..180D ; ID_Continue # Mn [3] MONGOLIAN FREE VARIATION SELECTOR ONE..MONGOLIAN FREE VARIATION SELECTOR THREE FE00..FE0F ; ID_Continue # Mn [16] VARIATION SELECTOR-1..VARIATION SELECTOR-16 E0100..E01EF ; ID_Continue # Mn [240] VARIATION SELECTOR-17..VARIATION SELECTOR-256
Date/Time: Fri Jun 13 22:36:38 CDT 2014
Name: Dmitry S.
Report Type: Error Report
Opt Subject: Typo in paragraph 3.6 of UTS #18 Unicode Regular Expressions
Hello, In section "3.6 Context Matching" http://www.unicode.org/reports/tr18/#Context_Matching there is a typo in the table with examples: the last column of the last two rows contains a string "ca not" which should be corrected to "cannot". Thanks, Dmitry S.
Date/Time: Tue Jun 17 14:46:22 CDT 2014
Name: Dmitry S.
Report Type: Error Report
Opt Subject: Typo in UTS #10 Unicode Collation Algorithm
Hello, There is a typo in section "3.8.1 Default Values" of UTS #10 Unicode Collation Algorithm (both 6.3.0 and 7.0.0): in the last sentence of the first paragraph it is written as follows: "The unmarked characters will a3) equal to MIN3." It seems that this should be corrected to the following: "The unmarked characters will have a3 equal to MIN3." Thanks, Dmitry S.
Date/Time: Wed Jun 18 15:40:40 CDT 2014
Name: Dmitry S.
Report Type: Error Report
Opt Subject: Possible error in UTS #10 Unicode Collation Algorithm
Hello, in UTS #10 Unicode Collation Algorithm version 7.0.0 clause S2.1.2 (http://www.unicode.org/reports/tr10/#S2.1.2) there seems to be an error in a note below the clause: "Note: A non-starter in a string is called blocked if there is another non-starter of the same canonical combining class or zero between it and the last character of canonical combining class 0." The "... non-starter of the same canonical combining class OR ZERO..." part seems erroneous to me because of the following: 1) UAX #15 http://www.unicode.org/reports/tr15/#Description_Norm defines non-starter as follows: "Most characters (including all non-combining marks) have a Canonical_Combining_Class value of zero, and are unaffected by the Canonical Ordering Algorithm. Such characters are referred to by a special term, starter. Only the subset of combining marks which have non-zero Canonical_Combining_Class property values are subject to potential reordering by the Canonical Ordering Algorithm. Those characters are called non-starters." 2) D107 Starter definition in the Unicode Standard: "D107 Starter: Any code point (assigned or not) with combining class of zero (ccc=0)." The latter excerpts imply that a non-starter cannot have Canonical_Combining_Class value of zero (ccc=0) which stated otherwise in the note mentioned. Thanks, Dmitry S.
Analysis of the above report by Ken Whistler, 2014/06/18:
O.k., yes, this *is* a problem in wording, and it is non-trivial to fix. The note in question goes at least back to Version 4.0 of UTS #10, although its position in the text migrated a bit later on. In the UTS #10 4.0 version, it is: Note: A combining mark in a string is called blocked if there is another combining mark of the same canonical combining class or zero between it and the last character of canonical combining class 0. right below Step 2 in Section 4.2. It logically refers to Step 2.1.2, which is where the note was later moved. Then a comedy of errors ensues. In later versions of the text, the note was updated by replacing "combining mark" with "non-starter", without adjusting the text "or zero" correctly. But wait! It gets worse. This text, which was derived from the 4.0 version of UAX #15, where it defined starter for normalization, was not then adjusted for Corrigendum #5 (from February, 2005!), which inserted the wording "or higher" in the definition of blocked in UAX #15. And disconnected as it was, it then certainly did not follow the later move of all the definitions related to normalization *out* of UAX #15 and into Chapter 3 of the core spec (as of Version 5.2.0). And when they went into Chapter 3, the wording for "starter" was essentially unchanged, but the wording for "blocked" got a complete overhaul. So my conclusion is that all of the wording about starter and blocked in UTS #10 needs a serious update, to make correct references to the *current* definitions in Chapter 3, rather than using ad hoc, out-of-date definitions from 2005 derived from a long-superseded version of UAX #15. Doing *that* will require some significant work on this section of the text. --Ken
Date/Time: Thu Jun 19 11:18:19 CDT 2014
Name: Addison Phillips
Report Type: Error Report
Opt Subject: Bad example in Figure 2, UAX#15
Figure 2 in UAX#15 (Normalization Forms) contains examples of different types of "compatibility equivalence". The second line in this table is for "breaking differences" and shows the hyphen-minus character as the example. However, the only example I can find in TUS or the UCD of a "breaking difference" that is a case of compatibility decomposition (in fact, it is cited in Chapter 2 of TUS) is between U+00A0 (non-breaking space) and regular space. While it's really difficult to illustrate different kinds of space characters in a table, perhaps using a placeholder ("NBSP", "(non-breaking space)", etc.) might work? Or maybe add some attendent prose to explain the table? Note: The term "breaking difference" appears nowhere else that I can find in UAX15 or in the relevant sections of TUS related to compatibility decomposition.
Date/Time: Sat Jun 21 19:05:39 CDT 2014
Name: Samuel Bronson
Report Type: Error Report
Opt Subject: UAX #11: refers to biwidth fonts as "legacy"
In UAX#11, you say: >> An important class of fixed-width legacy fonts contains glyphs of just two widths, with the wider glyphs twice as wide as the narrower glyphs. I don't think it's correct to think of all such fonts as "legacy": such fonts tend to be popular with programmers, and I get the impression that, say, Japanese people usually like text to be typeset on a grid, too. (Granted, the ones that make characters fullwidth *just* because they are encoded using two bytes in some encoding or other are a bit silly.) If we could only get sensible wcwidth() values even for latin/punctuation/math characters and make the fonts to match, we'd *really* have something ... say, making EM DASH perceptibly wider than HYPHEN-MINUS?
Date/Time: Sat Jun 28 07:52:44 CDT 2014
Name: Diego Perini
Report Type: Other Question, Problem, or Feedback
Opt Subject: Correction for #Validity_Criteria UTS #46
There is a small syntax error in: http://www.unicode.org/reports/tr46/#Validity_Criteria the text: "2 - The label must not contain a U+002D HYPHEN-MINUS character in both the third position and fourth positions." Should be changed to: "2 - The label must not contain a U+002D HYPHEN-MINUS character in both the third and fourth positions."
Date/Time: Mon Jul 14 00:05:39 CDT 2014
Name: Karl Williamson
Report Type: Error Report
Opt Subject: UTS18 typo
The final line in Section 1.2 should be \p{Script_Extensions=Katakana} NOT \p{Script_Extensions=Hiragana}
Date/Time: Mon Jul 14 15:29:43 CDT 2014
Name: Markus Scherer
Report Type: Error Report
Opt Subject: UAX #38 kDefaultSortKey should distinguish traditional vs. simplified radicals
UAX #38 says: 2.1 Database design kDefaultSortKey "Bits 23-30 are the character’s KangXi radical number used [...] The difference between simplified and traditional radical is ignored." This appears to be incorrect: The Han code chart (http://www.unicode.org/charts/PDF/U4E00.pdf) shows that the forms of the radicals are distinguished. For example, the characters with radical 120 (silk) are grouped together, and followed by the group of those with radical 120' (silk/C-simplified). See the chart at U+7CF8 and U+7E9F. I expect that most if not all of the main Unihan block (4E00..9FFF) should follow the kDefaultSortKey order. If this expectation is not intended to be true, it should be documented for kDefaultSortKey. (I assume that possible exceptions would be due to corrections of the Unihan data since the original allocation.) I suggest to either restate the default sort key as something other than int bit fields (with the added distinction), or else using unsigned int (32-bit) or long (64-bit) bit fields, adding one bit for traditional (0) vs. simplified (1). Given the existing action items for kDefaultSortkey ([139-A19a], [139-A21], see http://www.unicode.org/review/pri266/feedback.html) I suggest to simplify it as follows: Use a 64-bit integer with a less dense and therefore less error-prone encoding: Bits 20.. 0 code point (avoids complications re [139-A19a]) Bit 23 set to 0 if the code point is U+4E00..U+FFFF, else set to 1 ([139-A21], UCA implicit weights BASE FB40 vs. FB80) Bits 29..24 residual stroke count (0..63) Bit 30 set to 0 if traditional radical form (e.g., 120), set to 1 if simplified (120') Bits 39..32 radical number (1..214)
Date/Time: Thu Jul 31 22:00:08 CDT 2014
Name: Markus Scherer
Report Type: Public Review Issue
Opt Subject: WD UTR #51 Unicode Emoji
The <title> says "UTS #51". It's not a UTS. Please change to "Working Draft UTR #51". Section 1 Introduction is good, but I feel strongly that the section on Longer Term Solutions should follow right after, rather than late in the document. The document points to at least one doc in unicode.org/~scherer/ -- we should copy that into a permanent location, for example reports/tr51/. I suggest deleting 1.2 Goals. It duplicates some of the ToC; it says that the material is subject to change (as usual); and the last sentence "This document does not discuss..." should be merged into the Summary at the top which partially contradicts it. 5 Sorting -- I am personally a bit skeptical about the need for sophisticated sorting *among* symbols, including Emoji. 6 Searching -- this is useful information, but very different from "search" as in UTS #10, for example, and it covers a variety of methods. This makes the heading misleading. Please rename to "Input Methods" or "Selection Methods" or similar. Data charts: It would be useful to repeat the column headings once in a while, at least in long, multi-column tables as in full-emoji-list.
Date/Time: Thu May 15 17:29:05 CDT 2014
Name: Richard Wordingham
Report Type: Error Report
Opt Subject: TUS: Special Cases with Malayalam RA
NOTE: The editorical committee has already looked at this feedback. Some of the items are complete, and the committee is dealing with other issues in the Malayalam block intro.
TUS 6.2/6.3 Section 9.9 ‘Special Cases Involving ra’ has a number of problems and errors. 1) The title should say ‘rra’, not ‘ra’. 2) The following paragraph leaves the impression that <0D31, 0D31> might be treated as a unit in rendering. The paragraph following it needs to dispel that impression. “Repetition of the letter, written either റ്റ or ററ, is also used for the sound /tt/. The sequence of two റ letters fundamentally behaves as a digraph in this instance. The digraph can bear a vowel sign in which case the digraph as a whole acts graphically as an atom: a left vowel part goes to the left of the digraph and a right vowel part goes to the right of the digraph. Historically, the side-by-side form was used until around 1960 when the stacked form began appearing and supplanted the side-by-side form. As a consequence the graphical sequence ററ in text is ambiguous in reading. The reader must generally use the context to understand if this is read /rr/ or /tt/. It is only when a vowel part appears between the two റ that the reading is unambiguously /rr/. Note that similar situations are common in many other orthographies. For example, th in English can be a digraph (cathode) or two separate letters (cathouse); gn in French can be a digraph (oignon) or two separate letters (gnome).” 3) The following paragraph is false. For example, <0D31, 0D31, 0D46> is rendered as ററെ. “The sequence <0D31, 0D31> is rendered as ററ, regardless of the reading of that text. The sequence <0D31, 0D4D, 0D31> is rendered as റ്റ. In both cases, vowels signs can be used as appropriate, as shown in Table 9-31.” To address this and the previous problem, I suggest replacing it by: “The sequence <0D31, 0D31> is rendered as ററ, possibly with the incorporation of vowel signs between, regardless of the reading of that text. A vowel appearing on the left must be encoded after the first occurrence of 0D31, and a vowel appearing on the right must be encoded after the second occurrence of 0D31. Two-part vowel characters may not be used with the side-by-side digraph. The sequence <0D31, 0D4D, 0D31> is rendered as റ്റ, and vowel signs are encoded after it. Examples are shown in in Table 9-31.”
Date/Time: Sun Jun 29 06:33:12 CDT 2014
Name: Claus Faerber
Report Type: Error Report
Opt Subject: Inconsistency between IdnaMappingTable.txt and IdnaTest.txt
Hi, I'm the author of the perl module Net::IDN::Encode (available on CPAN), which uses automated testing based on the IdnaTest.txt data file provided with Unicode. After updating to Unicode 7.0.0 (module version 2.200), some of the tests fail on a Unicode-enables perl (v5.21.1). This seems to be caused by inconsistencies in the data files provided with Unicode: For example, consider lines 4827 and 4828 in IdnaTest.txt: B; 🌱.𐋱₂; [P1 V6]; [P1 V6] B; 🌱.𐋱2; [P1 V6]; [P1 V6] These strings contain '🌱' (U+1F331) and '𐋱' (U+102F2). The later is new in Unicode 7.0.0. The first string also contains '₂' (U+2082), the second '2' (U+0032), both of which output as '2' (U+0032). The tests indicate that processing should throw error P1 or V6, which would indicate that the strings contain invalid characters. However, according to the IdnaMappingTable.txt, all of the characters in these strings are 'valid' (although they would not be valid under IDNA 2008): 2082 ; mapped ; 0032 # 1.1 SUBSCRIPT TWO 102E1..102FB ; valid ; ; NV8 # 7.0 COPTIC EPACT DIGIT ONE..COPTIC EPACT NUMBER NINE HUNDRED 1F330..1F335 ; valid ; ; NV8 # 6.0 CHESTNUT..CACTUS Only characters new in Unicode 7.0 seem to be affected. If I change the module to treat all characters added in Unicode 7.0 as 'invalid', all tests are successful. I think the error is in IdnaTest.txt but I'm not completely sure.
Date/Time: Fri Jun 20 13:12:37 CDT 2014
Name: Roozbeh Pournader
Report Type: Error Report
Opt Subject: Glyph for U+1F44E THUMBS DOWN SIGN potentially wrong
The glyph for U+1F44E THUMBS DOWN SIGN may better show the back of the hand, as it's actually very hard to make such a gesture as shown. Looking at the source glyphs at L2/09-027R2 (http://www.unicode.org/L2/L2009/09027r2-emoji-backgrnd.pdf), it appears that the SoftBank glyph shows the back of the hand for this character, while KDDI shows the front. (From https://code.google.com/p/android/issues/detail?id=71948)
Date/Time: Tue Jun 24 09:22:05 CDT 2014
Name: Daniel Klein
Report Type: Other Question, Problem, or Feedback
Opt Subject: Normalisation of Indic scripts
Hi! I was normalising some text into Form D with mixed Latin and Sinhala characters and I was surprised that the Sinhala mark for "o" was decomposed into "e" and "aa" (which is how it's typed on a Sinhala typewriter). I realise that the character looks exactly like the other two combined but they don't render the same as two characters (the combining ring is present) and have a very different phonological meaning. e.g. කොළ (ක + ො + ළ) "kola" (green) & කොළ (ක + ෙ + ා + ළ) an impossible spelling (and probably pronunciation) of "keaala" (no such word in Sinhala). I checked on http://www.unicode.org/charts/normalization/chart_Sinhala.html and noticed three other characters, too. It seems to me the same as decomposing "d" into "cl" because if you combine them they look the same. Also, "℅" does not become "c/o" in Form D, only in Form KC, as well as other related symbols. I'm not sure that these Sinhala characters should ever be decomposed, even in Form KD as it changes the spelling, meaning, appearance and pronunciation of the words they appear in. I had a quick look at Tamil and noticed the same thing. I would imagine that this is the case for most Indic scripts in Unicode (almost all write "o" as a combination of a preceding "e" and a following "aa"). Even more problematic is ෝ "oo" as ා + ් never combine except with ෙ. කෝ (ක + ෝ) vs කෝ (ක + ෙ + ා + ්). If, however, you think I am wrong (there must have been a reason for doing it this way) I would love to know the rationale. The only thing I can think of is to maintain compatibility with proprietary encodings that don't have a separate character for "o" but render all characters as they appear visually but this seems like a bad idea to me as the text should be converted to Unicode correctly in the first place. Regards, Daniel // Addendum, July 20: Hi Rick, I happened to find the following in NamesList.txt: @ Two-part dependent vowel signs @+ These vowel signs have glyph pieces which stand on both sides of the consonant; they follow the consonant in logical order, and should be handled as a unit for most processing. 0DDC SINHALA VOWEL SIGN KOMBUVA HAA AELA-PILLA = sinhala vowel sign o : 0DD9 0DCF 0DDD SINHALA VOWEL SIGN KOMBUVA HAA DIGA AELA-PILLA = sinhala vowel sign oo : 0DDC 0DCA 0DDE SINHALA VOWEL SIGN KOMBUVA HAA GAYANUKITTA = sinhala vowel sign au : 0DD9 0DDF The important bit is "should be handled as a unit for most processing". I believe that the current behaviour of normalising these characters into their lookalikes goes against this statement. Cheers, Daniel
(Note: This came through the Unicode mail list:)
From: Benjamin Riefenstahl
Subject: Problem with Mandaic shaping, IT and IN switched
Date: Mon, 30 Jun 2014 22:47:39 +0200
Hi everybody, I am currently in the process of designing a simple OpenType font for Mandaic. As some of you are probably aware, shaping in OpenType as it is recommended by the OpenType standard requires that the application (i.e. the text rendering engine) knows the joining behaviour of the characters. It seems that there is an error in the joining data for Mandaic as defined by the Unicode standard (table 14-5 and 14-6, chapter 14.12 in version 6.3) and by the file ArabicShaping.txt at http://www.unicode.org/Public/UNIDATA/ArabicShaping.txt. The tables list the character IT as dual-joining and the character IN as right-joining. These two seem to be switched. In the table columns with the actual characters (columns Xn, Xr, Xm, Xl) the correct characters are given (compare the code chart at http://www.unicode.org/charts/PDF/U0840.pdf), but the names (and the relative positions in the tables) are wrong and that error is than taken over into the file ArabicShaping.txt: 0847; MANDAIC IT; D; No_Joining_Group [...] 084F; MANDAIC IN; R; No_Joining_Group The correct characters in the table should be (in this order) * Dual-Joining: ATT, AK, AL, AM, AS, IN, AP, ASZ, AQ, AR, AT * Right-Joining: HALQA, AZ, IT, AKSA, ASH And the correct data in ArabicShaping.txt: 0847; MANDAIC IT; R; No_Joining_Group 084F; MANDAIC IN; D; No_Joining_Group Please advise what I can do to help correct this in some future version of the Unicode standard. Regards, Benjamin Riefenstahl -------------- Curiously I am having a hard time finding clear references. There are some Mandaic texts online where we can find examples, but I cannot find a reliable theoretical discussion of the script at the level of detail that I would wish for. There is the "Mandäische Grammatik" by Theodor Nöldeke, from 1875 (see https://archive.org/details/mandischegramma01nlgoog), which has a note to his table of the characters, but that note seems incomplete, it reads: <zain>, <het>, <yod>, <shin> werden nicht nach links verbunden. The note quotes the characters in Hebrew letters. It leaves out the aleph (halqa) which also belongs into this group. Regards, Benjamin Riefenstahl
Some info from Rick McGowan:
The tables in the latest Core spec draft show: Table 9-19 contains "IN" as a dual joining letter. Table 9-20 contains "IN" as a right joining letter. So, the English gloss "IN" appears in two different type tables. That's one problem. To help unravel, see Roozbeh's doc for Mandaic here: http://www.unicode.org/L2/L2010/10413-mandaic-joining-type.pdf and the original proposal here: http://www.unicode.org/L2/L2008/08270r-n3485r-mandaic.pdf The row of table 9-19 which is *labelled* "IN" should actually be "IT" -- at least according to the proposal. The shape looks like it, to me.
Date/Time: Thu Jul 10 12:09:22 CDT 2014
Name: Christian Lerch
Report Type: Error Report
Opt Subject: Coding error for age property in UCD
At least in versions 6.3.0 and 7.0.0 (haven't checked others) there is an inconsistent coding of the age property value of "Unassigned" in either the ucd file PropertyValueAliases.txt or in the ucdxml xml files. In the former the abbreviated name (2nd field) for value "Unassigned" is given as "NA". In the later, however, instead of having age="NA" you find age="unassigned", which has no entry in PropertyValueAliases.txt
Date/Time: Tue Jul 22 10:18:15 CDT 2014
Name: Andrew West
Report Type: Error Report
Opt Subject: U+2220 ANGLE and U+299F ACUTE ANGLE
Note: This has already been done by the editorial committee.
Suggest adding a cross-reference between the following pair of characters with similar meanings and very similar glyphs: U+2220 ANGLE U+299F ACUTE ANGLE Also may be a good idea to add to confusables.txt.
Date/Time: Mon Jul 28 08:53:40 CDT 2014
Name: William Overington
Report Type: Other Question, Problem, or Feedback
Opt Subject: Regarding the working draft version of Unicode Technical Report #51 dated 2014-07-24, Section 6.
Regarding the working draft version of Unicode Technical Report #51 dated 2014-07-24, Section 6. I suggest that the following text be substituted by the text that follows it. quote There is one further kind of annotation, called a TTS name, for text-to-speech processing. For accessibility when reading text, it is useful to have a short, descriptive name for an emoji character. A Unicode character name can often serve as a basis for this, but its requirements for name uniqueness often ends up with names that are overly long, such as black right-pointing double triangle with vertical bar for ⏯. TTS names are also outside the current scope of this document. end quote The following is the text that I suggest be substituted in place of the above text, based upon the text from document L2/14-093 and from the draft dated 2014-07-24, though also including some of my own thoughts. new text starts There is one further kind of label, called a Localization Label. A Localization Label could be used for producing a text-to-speech facility or for expressing the meaning of a symbol in natural language, which could be helpful for an abstract symbol such as "Do not tumble dry". For accessibility when reading text, it is useful to have a short, descriptive name for an emoji character. A Unicode character name can often serve as a basis for this, but its requirements for name uniqueness often ends up with names that are overly long, such as black right-pointing double triangle with vertical bar for ⏯. Please note that Localization Labels need to be in each user’s language to be useful. They cannot simply be a translation of an English label, since different words, or even different categorizations, may be what is expected in different languages. The terms given in the data files here have been collected from different sources. They are only initial suggestions, not expected to be complete, and only in English. Apart from mentioning the concept here, Localization Labels are outside of the scope of this document. new text ends It may be that you will choose to refine that text further: I feel that it is important that reference to localization is conserved. Unicode can be used to typeset many languages and so reference to localization seems very relevant. I declare an interest in that I have been for some years researching communication through the language barrier using encoded localizable sentences and as part of my research I have, experimentally, designed symbols for various sentences. The symbols are mostly abstract rather than pictographic, though there are a few pictographic elements within some of the symbols, such as, for example, a stylized snowflake in some of the sentences that are about the weather. So Localization Labels becoming part of Unicode would help my research. Certainly, Localization Labels would help my research, however there are also many abstract symbols, such as "Do not tumble dry" and "Do not dry clean" where the facility of a Localization Label could be of advantage to a person who has not met the symbol previously. Perhaps I should mention that in England, where the weather is very changeable, often from day to day, talking about the weather is part of the culture: it is topical, sociable and not controversial. Here are just a few sentences for which I have produced symbols. Yes. No. Good day. The following question has been asked. My answer is as follows. I need more information in order to be able to answer. It is snowing. It is summer. Where is a pharmacy please? Where can I buy a vegan meal with no gluten-containing ingredients in it please? Information Desk Sculpture Gallery Is there any information about the following person please? The enquirer is the brother of the first person that was named. The person is safe. The last three sentences in the above list are from a collection of sentences designed to help find information about a relative or friend after a disaster. At present my research is by using a markup sequence to encode each sentence, thereby increasing interoperability by avoiding using a Private Use Area encoding: a symbol can be displayed using a special OpenType font. Certainly I would like, indeed prefer, to be able to decode automatically directly from markup to natural language, yet that will require new software to be written. Decoding to a symbol using an OpenType font is something that I can do now as I am able to make an OpenType font for the purpose using an existing fontmaking program and then use the font in an existing desktop publishing package.. Yet as emoji are developed, maybe sentences will become encoded in emoji sets, as abstract symbols, whereupon having the feature of Localization Labels already established in Unicode would be of advantage for interoperability. William Overington 28 July 2014
Date/Time: Sun Aug 3 20:03:37 CDT 2014
Name: John Cowan
Report Type: Public Review Issue
Opt Subject:
This is a comment on L2/14-187, "Cherokee casing decision may break identifier syntax" I think we have to take into account that Cherokee may not be the last script that becomes problematic. History shows that when unicameral scripts become bicameral, the older forms tend to become the upper case. This is true of Latin, Greek, and Cyrillic at least, even if many modern Cyrillic lowercase forms tend to resemble their uppercase prototypes. My personal view is that using casing distinctions in this way is a Bad Thing, because unicameral scripts cannot be accommodated. But it's already used in Haskell and Go, and maybe elsewhere.