Revision | 1.0 |
Authors | Mark Davis (mark.davis@us.ibm.com, home) |
Date | |
This Version | http://www.unicode.org/unicode/reports/tr24/tr24-1 |
Previous Version | N/A |
Latest Version | http://www.unicode.org/unicode/reports/tr24/tr24 |
This document provides an assignment of script names to all Unicode codepoints.
This document contains material which has been considered and approved by the Unicode Technical Committee for publication as a Proposed Draft Technical Report. At the current time, the specifications in this technical report are provided as information and guidance to implementers of the Unicode Standard, but do not form part of the standard itself. The Unicode Technical Committee may decide to incorporate all or part of the material of this technical report into a future version of the Unicode Standard, either as informative material or as normative specification. Please mail corrigenda and other comments to the author.
The content of all technical reports must be understood in the context of the appropriate version of the Unicode Standard. References in this technical report to sections of the Unicode Standard refer to the Unicode Standard, Version 3.0. See http://www.unicode.org/unicode/standard/versions for more information.
Scripts.txt provides a mapping from Unicode characters to script names. This information is useful in such circumstances as regular expressions (see UTR #18: Unicode Regular Expression Guidelines), and produces better results than simple matches on block names.
Note: it is expected than the Scripts.txt file will eventually be part of the Unicode Character Database. Until such time, the assignment of scripts is preliminary, and may change without notice.
Script values cannot simply be extracted from the block ranges in Blocks.txt. In some cases, blocks contain more than two scripts, in other cases a single script is split over several blocks. Note, however, that although script names are often more useful than simple block names, one cannot make too many assumptions; in some cases languages may use characters from more than one script. This is especially the case for non-letters: for that reason, only characters of general category Letter are given distinct script names: all others are given the script name Common, for undetermined script. The script names form a full partition of the code space: every codepoint is assigned a single script name.
The format of the file is similar to that of Blocks.txt. The fields are separated by semicolons. The first two fields provide the first and last characters in a range. The third field provides the script name for that range. The comment (after a #) provides the names for the first and last characters in the range. All unassigned or illegal codepoints in the range must be ignored, and are given the script value Common instead.
Note: There is a draft standard, ISO 15924 (http://www.egt.ie/standards/iso15924/), that supplies script codes. These can be used to represent abbreviations for the script names.
The Scripts.txt is currently available at Scripts-1d3.txt. The contents are preliminary, and may change in the future. There is an additional set of charts that can be used to see the assignment of scripts. These charts show the entire range of Unicode characters broken down by script name (for letters) and general category (for others). To properly view these charts, you should install a Unicode font for use by your browser.
Copyright © 1999-2000 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.