L2/13-039
To: UTC
From: Mark Davis
Re: Script and Script Extension property principles
This is in response to action 133-A044 “Provide a proposal for the January 2013 UTC meeting for a principle for how to allow characters to have explicit script values and multiple script extension values; and suggested changes to the text and properties to accord with that.” (referenced doc)
At the bottom of this document is a comparison of the current (U6.2) Script property and Script Extensions property values, where they are not identical, followed by a list of the affected characters.
Here are suggested changes to the text, following Ken’s suggestion of making this an exception rather than a “principle”.
...
OLD | NEW |
If a character is only regularly used with a single script, then it is given that specific Script property value (as opposed to Common or Inherited). This facilitates the use of the script property for common tasks such as regular expressions, but it also means that some characters that are definite members of a given script, based on their forms and history, nevertheless are assigned one of the generic values. As more data on the usage of individual characters is collected, the Script property value assigned to a character may change. Rarely would a character change from one specific script to another. However, if it becomes established that a character is regularly used with more than one script, it will be assigned the Common or Inherited Script property value. Similarly, if it becomes established that a character is regularly used with only a single, specific script, it will be assigned a specific Script property value. The occasional use of character from one script in the context of another script, as for instance the citation of a Greek letter used as a mathematical constant in the midst of Latin text, or the use of a Latin letter in the midst of Han text, is not considered sufficient evidence of "regular use" requiring a designation of Common Script property value. It is also possible for a character, once given a Common or Inherited Script property value, upon further research, to be changed to a specific script, instead. | If a character is only regularly used with a single script, then it is given that specific Script property value (as opposed to Common or Inherited). In few instances, characters known to be used with more than one script, but which are overwhelmingly associated with and used with a single script, also take the Script property value of that script. The assignment of a single script facilitates the use of the script property for common tasks such as regular expressions. but it also means that some characters that are definite members of a given script, based on their forms and history, nevertheless are assigned one of the generic values. As more data on the usage of individual characters is collected, the Script property value assigned to a character may change. Rarely would a character change from one specific script to another. However, if it becomes established that a character is regularly used with more than one script, it may be assigned the Common or Inherited Script property value. Similarly, if it becomes established that a character is regularly used with only a single, specific script, it may be assigned a specific Script property value. The occasional use of character from one script in the context of another script, as for instance the citation of a Greek letter used as a mathematical constant in the midst of Latin text, or the use of a Latin letter in the midst of Han text, is not considered sufficient evidence of "regular use" requiring a designation of Common Script property value. It is also possible for a character, once given a Common or Inherited Script property value, upon further research, to be changed to a specific script, instead. |
(add just before “The Script_Extensions property values are given in the file ScriptExtensions.txt in the Unicode Character Database [UCD].”)
However, there are some invariants that can be depended on:
A character could have any of the following combinations of properties:
I found the following while looking at the text. Although we define “explicit” in “All other Script property values are referred to as explicit script values, because they each refer to one specific script.” we don’t always use it consistently. We should search for “specific” and change if necessary.
In accordance with this change, we’d make the following property changes:
SC SCX Chars
Arabic Arabic Thaana [ﷲ]
Common Arabic Mandaic Syriac [ـ]
Common Arabic Syriac Thaana [،؛؟]
Common Arabic Thaana [٠-٩﷽]
Inherited Arabic Syriac [ً-ٰٕ]
=> Script=Arabic
Script=Inherited, SCX=Inherited
U+0363 ( ͣ ) COMBINING LATIN SMALL LETTER A
…
U+036F ( ͯ ) COMBINING LATIN SMALL LETTER X
U+1DD4 ( ᷔ ) COMBINING LATIN SMALL LETTER AE
...
U+1DE6 ( ᷦ ) COMBINING LATIN SMALL LETTER Z
=> Script=Latn
SC SCX Chars
Arabic Arabic Thaana [ﷲ]
Common Arabic Mandaic Syriac [ـ]
Common Arabic Syriac Thaana [،؛؟]
Common Arabic Thaana [٠-٩﷽]
Inherited Arabic Syriac [ً-ٰٕ]
Common Armenian Georgian [։]
Inherited Greek [͂᷀᷁ͅ]
Inherited Latin [ͣ-ͯ]
Inherited Cyrillic Latin [҅҆]
Inherited Devanagari Latin [॒॑]
Common Devanagari [᳡ᳲᳳ]
Inherited Devanagari [᳐-᳔᳒-᳢᳠-᳨᳭᳴]
Common Bengali Devanagari Gurmukhi Oriya Takri [।॥]
Common Devanagari Gujarati Gurmukhi Kaithi Takri [꠰-꠹]
Inherited Bopomofo Han [〪-〭]
Common Bopomofo Hangul Han Hiragana Katakana
[〃〓〜-〟〰〷〾〿㇀-㇣㈠-㉃㊀-㊰㋀-㋋㍘-㍰㍻-㍿㏠-㏾﹅﹆]
Common Bopomofo Hangul Han Hiragana Katakana Yi [、。〈-】〔-〛・。-・]
Common Han Hiragana Katakana [〆〼〽㆐-㆟]
Common Hiragana Katakana [〱-〵゛゜゠ーー゙゚]
Inherited Hiragana Katakana [゙゚]
Common Mongolian Phags_Pa [᠂᠃᠅]
Common Cypriot Linear_B [𐄀-𐄂𐄇-𐄳𐄷-𐄿]
Common Buhid Hanunoo Tagbanwa Tagalog [᜵᜶]