Technical Reports |
Version |
4.0.0 |
Authors | Mark Davis (mark.davis@us.ibm.com) |
Date | 2003-04-17 |
This Version | http://www.unicode.org/reports/tr24/tr24-5.html |
Previous Version | http://www.unicode.org/reports/tr24/tr24-4.html |
Latest Version | http://www.unicode.org/reports/tr24/tr24 |
Tracking Number |
This document provides an assignment of script names to all Unicode code points. This information is useful in mechanisms such as regular expressions, where it produces much better results than simple matches on block names.
This document has been reviewed by Unicode members and other interested parties, and has been approved by the Unicode Technical Committee as a Unicode Standard Annex. This is a stable document and may be used as reference material or cited as a normative reference by other specifications.
A Unicode Standard Annex (UAX) forms an integral part of the Unicode Standard, but is published as a separate document. The Unicode Standard may require conformance to normative content in a Unicode Standard Annex, if so specified in the Conformance chapter of that version of the Unicode Standard. The version number of a UAX document corresponds to the version number of the Unicode Standard at the last point that the UAX document was updated.
Please submit corrigenda and other comments with the online reporting form [Feedback]. Related information that is useful in understanding this document is found in References. For the latest version of the Unicode Standard see [Unicode]. For a list of current Unicode Technical Reports see [Reports]. For more information about versions of the Unicode Standard, see [Versions].
The Unicode Character Database (UCD) provides data for a mapping from Unicode characters to script names. This information is useful for mechanisms such as regular expressions, where it produces much better results than simple matches on block names. There are quite a number of problems with using block names to distinguish characters:
For more information, see Character Blocks in UTR #18: Unicode Regular Expression Guidelines [UTR18].
Although script names are generally much more useful than simple block names, they cannot be applied blindly. The script assignment is particularly oriented towards mechanisms such as regular expressions, and is not intended to be used for other purposes such as graphology, history, or other unrelated purposes. The definition of script names in the data file do not preclude the assignment of scripts in appropriate ways for these other purposes.
The script name data provides a mapping from each Unicode code point to either a specific script such as Cyrillic, or to one of two special values:
Note: A non-spacing mark (a character with general category value of Mn or Me) normally inherits its properties from its base character, so that it is not separated from it in normal processing. The non-spacing marks generally have the property INHERITED to reflect this. However, in cases where the best interpretation of a non-spacing mark in isolation would be a specific script, then its script property value may be different from INHERITED.
Even so, implementations that determine the boundaries between characters of given scripts should never break between a non-spacing mark and its base character. Thus for boundary determinations and similar sorts of processing, a non-spacing mark — whatever its script value — should inherit the script value of its base character.
The script names form a full partition of the code space: every code point is assigned a single script name. As new scripts are added to the standard, additional script names will be added.
In many cases, programs will override the script name based upon the context of the surrounding characters, especially for the case of Common. A simple heuristic is to use the script of the preceding character, which works well in many cases. However, this may not always produce optimal results: for example, in the text "... gamma (γ) is ..." this heuristic would cause matching parentheses to be in different scripts. Thus more sophisticated programs may use more complex heuristics.
In general, programs should only use specific script values in conjunction with both Common and Inherited. That is, to distinguish a sequence of characters appropriate for Greek, one would use:
((Greek | Common) (Inherited |
Me | Mn)?)*
That is, characters that are either in Greek or in Common, optionally followed by those in Inherited. Specific languages may commonly use multiple scripts, so for Japanese one might use:
((Hiragana | Katakana | Han | Latin | Common)
(Inherited | Me | Mn)?)*
Given this usage model, the current data is weighted on inclusiveness: a character is in a specific script (rather than Common or Inherited) only when it is clearly not used within other scripts. As more data on individual characters is collected, characters may move from the Common group to a more specific script (including Inherited).
The script property is useful in regular expression syntax for easy specification of spans of text which consist of a single script (or mixture of scripts). However, users should be very careful to not misapply it. The script values form a full partition of the Unicode code space, but that partition does not exhaust the possibilities for useful and relevant script-like subsets of Unicode characters.
For example, a user might wish to define a regular expression to span typical mathematical expressions, but the subset of Unicode characters used in mathematics does not correspond to any particular script. Instead, it requires use of the Math property, other character properties, and particular subsets of Latin, Greek, and Cyrillic letters. For information on other character properties, see the UCD.
The script property values may also be useful in
providing user feedback to help signal possible spoofing, where
visually-similar characters (confusable characters) are substituted in
an attempt to mislead a user. For example, a domain name such as macchiato.com
could be spoofed with macchiatο.com
(with some Greek characters)
or maссhiato.com
(with some Cyrillic characters). The user can
be alerted to odd cases by displaying mixed scripts with different color,
highlighting, or boundary marks, such as macchiatο.com
or maссhiato.com
.
Possible spoofing is not limited to mixtures of scripts. Even in ASCII, there are confusable characters such as 0 and O, or 1 and l. Thus the use of script values would need to be augmented with other information such as general category values, plus exception lists of individual characters that are not distinguished by other Unicode properties.
For illustration, the following table lists some of the the Script Name values used in the data file. For a complete list of values, see [Scripts]. The names are not case-sensitive, and the order in which the scripts are listed here or in the data file is not significant.
Although Braille is not a script in the same sense that Latin or Greek is, it is given a script name in [Scripts]. This is useful because of the nature of the application of these script names, as in matching spans of similar characters in regular expressions.
In the Property Value Aliases file [PropValue], corresponding codes from ISO 15924: Code for the Representation of Names of Scripts [ISO15924] are provided as short names for the scripts.
Script Name | ISO 15924 |
---|---|
COMMON |
Zyyy |
INHERITED |
Qaai |
LATIN |
Latn (Latf, Latg) |
CYRILLIC |
Cyrl (Cyrs) |
ARMENIAN |
Armn |
HEBREW |
Hebr |
ARABIC |
Arab |
SYRIAC |
Syrc (Syrj, Syrn, Syre) |
GEORGIAN |
Geor (Geon, Geoa) |
... |
... |
Note: ISO 15924 provides an enumeration of four-letter script codes. In some cases the match between these script names and the ISO 15924 codes is not precise, since the goals are somewhat different. ISO 15924 is aimed primarily at the bibliographic identification of scripts; because of that it occasionally identifies varieties of scripts that may be useful for book cataloging, but which are not considered distinct as scripts in the Unicode Standard. For example, ISO 15924 has separate script codes for the Fraktur and Gaelic varieties of the Latin script.
Where there are no corresponding ISO 15924 codes, the "private use" ones starting with Q are used. Such values are likely to change in the future. In such a case, the Q-names will be retained as aliases in the UCD for backwards compatibility.
The Scripts.txt data file is available at [Scripts]. The format of the file is similar to that of Blocks.txt [Blocks]. The fields are separated by semicolons. The first field contains either a single code point, or the first and last code points in a range separated by "..". The second field provides the script name for that range. The comment (after a #) indicates the general category, and the character name. For a range, it adds the count in square brackets and uses the names for the first and last characters in the range. For example:
0B01; ORIYA # Mn ORIYA SIGN CANDRABINDU 0B02..0B03; ORIYA # Mc [2] ORIYA SIGN ANUSVARA..ORIYA SIGN VISARGA
The value COMMON is the default value, given to all code points that are not explicitly mentioned in the data file.
There is an additional set of Script Charts [Charts] that can be used to see the assignment of scripts. These charts show the entire range of Unicode characters broken down by script name (and general category where the script is Common or Inherited). If your browser is not set up for Unicode, see Display Problems.
[Blocks] | Blocks.txt For the latest version, see: http://www.unicode.org/Public/UNIDATA/Blocks.txt For other versions, see: http://www.unicode.org/standard/versions/ |
[Charts] | Script Charts http://www.unicode.org/reports/tr24/charts/ |
[Feedback] | Reporting Errors and Requesting
Information Online http://www.unicode.org/reporting.html |
[FAQ] | Unicode Frequently Asked Questions http://www.unicode.org/faq/ For answers to common questions on technical issues. |
[Glossary] | Unicode Glossary http://www.unicode.org/glossary/ For explanations of terminology used in this and other documents. |
[ISO15924] | ISO 15924: Code for the Representation
of Names of Scripts http://www.evertype.com/standards/iso15924/ |
[PropValue] | Property Value Aliases data file For the latest version, see: http://www.unicode.org/Public/UNIDATA/PropertyValueAliases.txt For other versions, see: http://www.unicode.org/standard/versions/ |
[Reports] | Unicode Technical Reports http://www.unicode.org/reports/ For information on the status and development process for technical reports, and for a list of technical reports. |
[Scripts] | Scripts data file For the latest version, see: http://www.unicode.org/Public/UNIDATA/Scripts.txt For other versions, see: http://www.unicode.org/standard/versions/ |
[UCD] | Unicode Character Database http://www.unicode.org/ucd For and overview of the Unicode Character Database and a list of its associated files |
[Unicode] | The Unicode Consortium. The Unicode Standard, Version 4.0. Reading, MA, Addison-Wesley, 2003. 0-321-18578-1. |
[UTR18] | UTR #18: Unicode Regular Expression
Guidelines http://www.unicode.org/reports/tr18/ |
[Versions] | Versions of the Unicode Standard http://www.unicode.org/standard/versions For information on version numbering, and citing and referencing the Unicode Standard, the Unicode Character Database, and Unicode Technical Reports. |
The following summarizes modifications from the previous version of this document.
5 |
|
4 |
|
3 |
|
Copyright © 1999-2003 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report. The Unicode Terms of Use apply.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.