Technical Reports |
Version | 4 |
Authors | Mark Davis (mark.davis@us.ibm.com) |
Date | 2002-04-01 |
This Version | http://www.unicode.org/unicode/reports/tr24/tr24-4.html |
Previous Version | http://www.unicode.org/unicode/reports/tr24/tr24-3.html |
Latest Version | http://www.unicode.org/unicode/reports/tr24/tr24 |
Base Unicode Version | Unicode 3.1 |
This document provides an assignment of script names to all Unicode code points. This information is useful in mechanisms such as regular expressions, where it produces much better results than simple matches on block names.
This document has been reviewed by Unicode members and other interested parties, and has been approved by the Unicode Technical Committee as a Unicode Technical Report. It is a stable document and may be used as reference material or cited as a normative reference from another document.
A Unicode Technical Report (UTR) may contain either informative material or normative specifications, or both. Each UTR may specify a base version of the Unicode Standard. In that case, conformance to the UTR requires conformance to that version or higher.
The References provide related information that is useful in understanding this document. Please mail corrigenda and other comments to the author(s).
The Unicode Character Database (UCD) provides data for a mapping from Unicode characters to script names. This information is useful for mechanisms such as regular expressions, where it produces much better results than simple matches on block names. There are quite a number of problems with using block names to distinguish characters:
For more information, see Character Blocks in UTR #18: Unicode Regular Expression Guidelines [UTR18].
Although script names are generally much more useful than simple block names, they cannot be applied blindly. The script assignment is particularly oriented towards mechanisms such as regular expressions, and is not intended to be used for other purposes such as graphology, history, or other unrelated purposes. The definition of script names in the data file do not preclude the assignment of scripts in appropriate ways for these other purposes.
The script name data provides a mapping from each Unicode code point to either a specific script such as Cyrillic, or to one of two special values:
The script names form a full partition of the code space: every code point is assigned a single script name. As new scripts are added to the standard, additional script names will be added.
In many cases, programs will override the script name based upon the context of the surrounding characters, especially for the case of Common. A simple heuristic is to use the script of the preceding character, which works well in many cases. However, this may not always produce optimal results: for example, in the text "... gamma (γ) is ..." this heuristic would cause matching parentheses to be in different scripts. Thus more sophisticated programs may use more complex heuristics.
In general, programs should only use specific script values in conjunction with both Common and Inherited. That is, to distinguish characters appropriate for Greek, one would use:
((Greek | Common) Inherited?)*
That is, characters that are either in Greek or in Common, optionally followed by those in Inherited. Specific languages may commonly use multiple scripts, so for Japanese one might use:
((Hiragana | Katakana | Han | Latin | Common)
Inherited?)*
Given this usage model, the current data is weighted on inclusiveness: a character is in a specific script (rather than Common or Inherited) only when it is clearly not used within other scripts. As more data on individual characters is collected, characters may move from the Common group to a more specific script (including Inherited).
Note: The script name property is useful in regular expression syntax for easy specification of spans of text which consist of a single script (or mixture of scripts). However, users should be very careful to not misapply it. The script names form a full partition of the Unicode code space, but that partition does not exhaust the possibilities for useful and relevant script-like subsets of Unicode characters. For example, a user might wish to define a regular expression to span typical mathematical expressions, but the subset of Unicode characters used in mathematics does not correspond to any particular script. Instead, it requires use of the Math property, other character properties, and particular subsets of Latin, Greek, and Cyrillic letters. For information on other character properties, see the UCD.
For illustration, the following table lists some of the the Script Name values used in the data file. For a complete list of values, see [Scripts]. The names are not case-sensitive, and the order in which the scripts are listed here or in the data file is not significant.
In the Property Value Aliases file [PropValue], corresponding codes from the forthcoming ISO 15924: Code for the Representation of Names of Scripts [ISO15924] are provided as short names for the scripts.
Script Name | ISO 15924 |
---|---|
COMMON |
Zyyy |
INHERITED |
Qaai |
LATIN |
Latn (Latf, Latg) |
GREEK |
Grek |
COPTIC |
Qaac |
CYRILLIC |
Cyrl (Cyrs) |
ARMENIAN |
Armn |
HEBREW |
Hebr |
ARABIC |
Arab |
SYRIAC |
Syrc (Syrj, Syrn, Syre) |
THAANA |
Thaa |
DEVANAGARI |
Deva |
BENGALI |
Beng |
GURMUKHI |
Guru |
GUJARATI |
Gujr |
ORIYA |
Orya |
TAMIL |
Taml |
TELUGU |
Telu |
KANNADA |
Knda |
MALAYALAM |
Mlym |
SINHALA |
Sinh |
THAI |
Thai |
LAO |
Laoo |
TIBETAN |
Tibt |
MYANMAR |
Mymr |
GEORGIAN |
Geor (Geon, Geoa) |
HANGUL |
Hang |
ETHIOPIC |
Ethi |
CHEROKEE |
Cher |
UCAS |
Cans |
OGHAM |
Ogam |
RUNIC |
Runr |
KHMER |
Khmr |
MONGOLIAN |
Mong |
HIRAGANA |
Hira |
KATAKANA |
Kana |
BOPOMOFO |
Bopo |
HAN |
Hani |
YI |
Yiii |
OLD_ITALIC |
Ital |
GOTHIC |
Goth |
DESERET |
Dsrt |
TAGALOG |
Tglg |
HANUNOO |
Hano |
BUHID |
Buhd |
TAGBANWA |
Tagb |
Note: The forthcoming ISO 15924 provides an enumeration of four-letter script codes. In some cases the match between these script names and the ISO 15924 codes is not precise, since the goals are somewhat different. ISO 15924 is aimed primarily at the bibliographic identification of scripts; because of that it occasionally identifies varieties of scripts that may be useful for book cataloging, but which are not considered distinct as scripts in the Unicode Standard. For example, ISO 15924 has separate script codes for the Fraktur and Gaelic varieties of the Latin script. Where there are no corresponding ISO 15924 codes, the "private use" ones starting with Q are used. These are in italics in the table above.
The Scripts.txt data file is available at [Scripts]. The format of the file is similar to that of Blocks.txt [Blocks]. The fields are separated by semicolons. The first field contains either a single code point, or the first and last code points in a range separated by "..". The second field provides the script name for that range. The comment (after a #) indicates the general category, and the character name. For a range, it adds the count in square brackets and uses the names for the first and last characters in the range. For example:
0B01; ORIYA # Mn ORIYA SIGN CANDRABINDU 0B02..0B03; ORIYA # Mc [2] ORIYA SIGN ANUSVARA..ORIYA SIGN VISARGA
The value COMMON is the default value, given to all code points that are not explicitly mentioned in the data file.
There is an additional set of Script Charts [Charts] that can be used to see the assignment of scripts. These charts show the entire range of Unicode characters broken down by script name (and general category where the script is Common or Inherited). If your browser is not set up for Unicode, see Display Problems.
[Blocks] | Blocks.txt http://www.unicode.org/Public/UNIDATA/Blocks.txt |
[Charts] | Script Charts http://www.unicode.org/unicode/reports/tr24/charts/ |
[Scripts] | Scripts data file For the latest version, see: http://www.unicode.org/Public/UNIDATA/Scripts.txt For other versions, see: http://www.unicode.org/unicode/standard/versions/ |
[FAQ] | Unicode Frequently Asked Questions http://www.unicode.org/unicode/faq/ For answers to common questions on technical issues. |
[Glossary] | Unicode Glossary http://www.unicode.org/glossary/ For explanations of terminology used in this and other documents. |
[ISO15924] | ISO 15924: Code for the Representation
of Names of Scripts http://www.evertype.com/standards/iso15924/ |
[PropValue] | Property Value Aliases data file For the latest version, see: http://www.unicode.org/Public/UNIDATA/PropertyValueAliases.txt For other versions, see: http://www.unicode.org/unicode/standard/versions/ |
[Reports] | Unicode Technical Reports http://www.unicode.org/unicode/reports/ For information on the status and development process for technical reports, and for a list of technical reports. |
[UTR18] | UTR #18: Unicode Regular Expression
Guidelines http://www.unicode.org/unicode/reports/tr18/ |
The following summarizes modifications from the previous version of this document.
4 |
|
3 |
|
Copyright © 1999-2001 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.