L2/11-406

Re:

Script Extensions as a Unicode Property

To:

UTC

From:

Mark Davis

Date:

2011-11-25

The script extensions just exist as a data file, and not a formal property. That makes them clumsier to cite and use. See, for example, the feedback from Karl Williamson on UTS #18, April 30.

We already have many multivalued Unicode character properties, in Unihan, so there is no formal difficulty in adding Script_Extensions as a provisional property. Here is a proposed description.

The ScriptExtension (scx) property has as values a set of one or more Script property values. The ScriptExtension value for a given character C is defined based on the UCD data files as follows:

  1. If C is in a field 0 in an entry of ScriptExtensions.txt, then the scx value is the set of script codes in field 1 of that entry.
  1. For example:
  2. The scx value for U+064B is {Arab, Syrc} because of the line "064B..0655 ; Arab Syrc".
  1. Otherwise, the scx value for C is a set consisting of a single element, the Script property value of C.
  1. For example:
  2. The scx value for U+0600 is {Arab}, and
  3. The value for U+0710 is {Syrc}.

When used in an expression to denote a set of characters, such as in the regular expression \p{scx=Arab}, the value of that expression is the set of all code points whose ScriptExtension value contains the given script. Thus:

  1. \p{scx=Arab} includes both U+064B and U+0600, but not U+0710
  2. \p{scx=Syrc} includes both U+064B and U+0710, but not U+0600.

The PropertyAliases.txt line would be:

SCX ; Script_Extensions

We would also add the following to the data file:

# @missing: 0000..10FFFF; Script_Extensions; <script>