|
|
Version | 1.2 |
Authors | Mark Davis (mark.davis@us.ibm.com) |
Date | 2000-10-27 |
This Version | http://www.unicode.org/unicode/reports/tr24/tr24-1.2.html |
Previous Version | http://www.unicode.org/unicode/reports/tr24/tr24-1.1.html |
Latest Version | http://www.unicode.org/unicode/reports/tr24/tr24 |
This document provides an assignment of script names to all Unicode code points. This information is useful in mechanisms such as regular expressions, where it produces much better results than simple matches on block names.
This document has been approved by the Unicode Technical Committee for public review as a Draft Unicode Technical Report. Publication does not imply endorsement by the Unicode Consortium. This is a draft document which may be updated, replaced, or superseded by other documents at any time. This is not a stable document; it is inappropriate to cite this document as other than a work in progress.
A list of current Unicode Technical Reports is found on http://www.unicode.org/unicode/reports/. For more information about versions of the Unicode Standard, see http://www.unicode.org/unicode/standard/versions/.
Scripts.txt provides a mapping from Unicode characters to script names. This information is useful for mechanisms such as regular expressions, where it produces much better results than simple matches on block names. (See the discussion of the deficiencies of Character Blocks in UTR #18: Unicode Regular Expression Guidelines.)
Script values cannot simply be extracted from the block ranges in Blocks.txt. In some cases, blocks contain more than two scripts, in other cases a single script is split over several blocks.
Although script names are generally much more useful than simple block names, one cannot make too many assumptions; in some cases languages may use characters from more than one script. This is especially the case for non-letters: for that reason, generally only characters of General Category Letter are given distinct script names: all others are given the script name Common, indicating an undetermined script.
In many cases, programs will override the script name based upon the context of the surrounding characters, especially for the case of Common. A simple heuristic is to use the script of the preceding character, which works well in many cases. However, this may not always produce optimal results: for example, in the text "... gamma (γ) is ..." this heuristic would cause matching parentheses to be in different scripts. Thus more sophisticated programs may use more complex heuristics.
The format of the file is similar to that of Blocks.txt. The fields are separated by semicolons. The first two fields provide the first and last code points in a range. The third field provides the script name for that range. The comment (after a #) provides the names for the first and last characters in the range. On the basis of this file, script values for any character in a string are derived as follows:
The script names form a full partition of the code space: every codepoint is assigned a single script name. As new scripts are added to the standard, additional script names will be added. In some cases, characters may change script names in the future.
Note: The assignment of scripts in this report are preliminary, and may change at any time.
The Scripts.txt is currently available at Scripts-1d3.txt. The contents are preliminary, and may change in the future. There is an additional set of charts that can be used to see the assignment of scripts. These charts show the entire range of Unicode characters broken down by script name (for letters) and general category (for others). To properly view these charts, you should install a Unicode font for use by your browser.
The following table lists the Script Name values used in the file, and the corresponding DIS 15924 code (where possible). The names are not case-sensitive.
Note: DIS 15924 (http://www.egt.ie/standards/iso15924/) provides an enumeration of four-letter script codes. In some cases the match between these script names and the DIS 15924 codes is not precise, since the goals are somewhat different. DIS 15924 is aimed primarily at the bibliographic identification of scripts; because of that it occasionally identifies varieties of scripts that are of significance for book cataloging, but which are not considered distinct as scripts in the Unicode Standard. For example, DIS 15924 has separate script codes for Fraktur and Gaelic varieties of the Latin script. Where there are no corresponding DIS 15924 codes, the "private use" ones starting with Q are used.
Script Name | Draft ISO 15924 code |
---|---|
UNKNOWN | Zyyy |
COMMON | Qaaa |
LATIN | Latn (Latf, Latg) |
GREEK | Grek |
COPTIC | Qaab |
CYRILLIC | Cyrl (Cyrs) |
ARMENIAN | Armn |
HEBREW | Hebr |
ARABIC | Arab |
SYRIAC | Syrc (Syrj, Syrn, Syre) |
THAANA | Thaa |
DEVANAGARI | Deva |
BENGALI | Beng |
GURMUKHI | Guru |
GUJARATI | Gujr |
ORIYA | Orya |
TAMIL | Taml |
TELUGU | Telu |
KANNADA | Knda |
MALAYALAM | Mlym |
SINHALA | Sinh |
THAI | Thai |
LAO | Laoo |
TIBETAN | Tibt |
MYANMAR | Mymr |
GEORGIAN | Geor (Geon, Geoa) |
JAMO | Qjam |
HANGUL | Hang |
ETHIOPIC | Ethi |
CHEROKEE | Cher |
UCAS | Cans |
OGHAM | Ogam |
RUNIC | Runr |
KHMER | Khmr |
MONGOLIAN | Mong |
HIRAGANA | Hira |
KATAKANA | Kana |
BOPOMOFO | Bopo |
HAN | Hani |
YI | Yiii |
Copyright © 1999-2000 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.