Revision | 3.0 |
Authors | Ken Whistler (ken@unicode.org); Glenn Adams (glenn@unicode.org) |
Date | 1999-11-12 |
This Version | http://www.unicode.org/unicode/reports/tr7/tr7-3 |
Previous Version | http://www.unicode.org/unicode/reports/tr7/tr7-2 |
Latest Version | http://www.unicode.org/unicode/reports/tr7 |
The content of all technical reports must be understood in the context of the appropriate version of the Unicode Standard. References in this technical report to sections of the Unicode Standard refer to the Unicode Standard, Version 3.0. See http://www.unicode.org/unicode/standard/versions for more information.
One tag identification character and one cancel tag character are also proposed. In particular, a language tag identification character is proposed to identify a language tag string specifically; the language tag itself makes use of RFC 1766 language tag strings spelled out using the Plane 14 tag characters. Provision of a specific, low-overhead mechanism for embedding language tags in plain text is aimed at meeting the need of Internet protocols such as ACAP, which require a standard mechanism for marking language in UTF-8 strings.
Tagging - The association of attributes of text with a point or range of the primary text. (The value of a particular tag is not generally considered to be a part of the "content" of the text. Typical examples of tagging is to mark language or font of a portion of text.)
Annotation - The association of secondary textual content with a point or range of the primary text. (The value of a particular annotation is considered to be a part of the "content" of the text. Typical examples include glossing, citations,exemplification,Japanese yomi, etc.)
Out-of-band - An out-of-band channel conveys a tag in such a way that the textual content, as encoded, is completely untouched and unmodified. This is typically done by metadata or hyperstructure of some sort.
In-band - An in-band channel conveys a tag along with the textual content, using the same basic encoding mechanism as the text itself. This is done by various means, but an obvious example is SGML markup, where the tags are encoded in the same character set as the text and are interspersed with and carried along with the text data.
However, there has been a great deal of controversy regarding the appropriate placement of language tags. Some have held that the only appropriate placement of language tags (or other kinds of tags) is out-of-band, making use of attributed text structures or metadata. Others have argued that there are requirements for lower-complexity in-band mechanisms for language tags (or other tags) in plain text.
The controversy has been muddied by the existence and widespread use of a number of in-band text markup mechanisms (HTML, text/enriched, etc.) which enable language tagging, but which imply the use of general parsing mechanisms which are deemed too "heavyweight" for protocol developers and a number of other applications. The difficulty of using general in-band text markup for simple protocols derives from the fact that some characters are used both for textual content and for the text markup; this makes it more difficult to write simple, fast algorithms to find only the textual content and ignore the tags, or vice versa. (Think of this as the algorithmic equivalent of the difficulty the human reader has attempting to read just the content of raw HTML source text without a browser interpreting all the markup tags.)
The Plane 14 technical report addresses the recurrent and persistent call for a lighter-weight mechanism for text tagging than typical text markup mechanisms in Unicode. It proposes a special set of characters used only for tagging. These tag characters can be embedded into plain text and can be identified and/or ignored with trivial algorithms, since there is no overloading of usage for these tag characters--they can only express tag values and never textual content itself.
Tag characters are not intended for general annotation of text.
This report is the result of an intense email discussion regarding language tagging and related issues, occasioned by the review of draft-ietf-acap-mlsf-01.txt and of draft-ietf-acap-langtag-00.txt, which proposed different mechanisms for language tagging in plain text. The Plane 14 technical report represents the consensus of a meeting of the UTC Working Group on Tagging and Annotation and of IETF representatives which took place on June 24,1997.
These tag characters are to be used to spell out any ASCII- based tagging scheme which needs to be embedded in Unicode plain text. In particular, they can be used to spell out language tags in order to meet the expressed requirements of the ACAP protocol and the likely requirements of other new protocols following the guidelines of the IAB character workshop (RFC 2130).
The suggested range in Plane 14 for the block reserved for tag characters is as follows, expressed in each of the three most generally used encoding schemes for ISO/IEC 10646:
UCS-4
U-000E0000 .. U-000E007F
UTF-16
U+DB40 U+DC00 .. U+DB40 U+DC7F
UTF-8
0xF3 0xA0 0x80 0x80 .. 0xF3 0xA0 0x81 0xBF
Of this range, U-000E0020 .. U-000E007E is the suggested range for the ASCII clone tag characters themselves.
In addition, there is one tag identification character and a CANCEL TAG character. The use and syntax of these characters is described in detail below.
The entire encoding for the proposed Plane 14 tag characters and names of those characters can be derived from the following list. (The encoded values here and throughout this technical report are listed in UCS-4 form, which is easiest to interpret. It is assumed that most Unicode applications will, however, be making use either of UTF-16 or UTF-8 encoding forms for actual implementation.)
U-000E0000 <reserved> U-000E0001 LANGUAGE TAG U-000E0002 <reserved> ... U-000E001F <reserved> U-000E0020 TAG SPACE U-000E0021 TAG EXCLAMATION MARK ... U-000E0041 TAG LATIN CAPITAL LETTER A ... U-000E007A TAG LATIN SMALL LETTER Z ... U-000E007E TAG TILDE U-000E007F CANCEL TAG
Range check expressed in UCS-4:
if ( ( *s >= 0xE0000 ) || ( *s <= 0xE007F ) )Range check expressed in UTF-16 (Unicode):
if ( ( *s == 0xDB40 ) && ( *(s+1) >= 0xDC00 ) && ( *(s+1) <= 0xDC7F ) )Expressed in UTF-8:
if ( ( *s == 0xF3 ) && ( *(s+1) == 0xA0 ) && ( *(s+2) & 0xFE == 0x80 ) )Because of the choice of the range for the tag characters, it would also be possible to express the range check for UCS-4 or UTF-16 in terms of bitmask operations, as well.
The tag identification character is used as a mechanism for identifying tags of different types. This enables multiple types of tags to coexist amicably embedded in plain text and solves the problem of delimitation if a tag is concatenated directly onto another tag. Although only one type of tag is currently specified, namely the language tag, the encoding of other tag identification characters in the future would allow for distinct tag types to be used.
No termination character is required for a tag. A tag terminates either when the first non Plane 14 Tag Character (i.e. any other normal Unicode value) is encountered, or when the next tag identification character is encountered.
All tag arguments must be encoded only with the tag characters U-000E0020 .. U-000E007E. No other characters are valid for expressing the tag argument.
A detailed BNF syntax for tags is listed below.
For example, to embed a language tag for Japanese, the Plane 14 characters would be used as follows. The Japanese tag from RFC 1766 is "ja" (composed of ISO 639 language id) or, alternatively, "ja-JP" (composed of ISO 639 language id plus ISO 3166 country id). Since RFC 1766 specifies that language tags are not case significant, it is recommended that for language tags, the entire tag be lowercased before conversion to Plane 14 tag characters. (This would not be required for Unicode conformance, but should be followed as general practice by protocols making use of RFC 1766 language tags, to simplify and speed up the processing for operations which need to identify or ignore language tags embedded in text.) Lowercasing, rather than uppercasing, is recommended because it follows the majority practice of expressing language tag values in lowercase letters.
Thus the entire language tag (in its longer form) would be converted to Plane 14 tag characters as follows:
U-000E0001 U-000E006A U-000E0061 U-000E002D U-000E006A U-000E0070
The language tag (in its shorter, "ja" form) could be expressed as follows:
U-000E0001 U-000E006A U-000E0061
The value of this string is then expressed in whichever encoding form (UCS-4, UTF-16, UTF-8) is required and embedded in text at the relevant point.
In each case, when a specific tag identification character is encoded, a corresponding reference standard for the values of the tags associated with the identifier should be designated, so that interoperating parties which make use of the tags will know how to interpret the values the tags may take.
A. The text itself goes out of scope, as defined by the application. (E.g. for line-oriented protocols, when reaching the end-of-line or end-of-string; for text streams, when reaching the end-of-stream; etc.)or
B. The tag is explicitly cancelled by the CANCEL TAG character.Tags of the same type cannot be nested in any way. The appearance of a new embedded language tag, for example, after text which was already language tagged, simply changes the tagged value for subsequent text to that specified in the new tag.
Tags of different type can have interdigitating scope, but not hierarchical scope. In effect, tags of different type completely ignore each other, so that the use of language tags can be completely asynchronous with the use of character set source tags (or any other tag type) in the same text in the future.
U-000E0001 U-000E007F
The value of the relevant tag type returns to the default state for that tag type, namely: no tag value specified, the same as untagged text.
The use of CANCEL TAG without a prefixed tag identification character cancels any Plane 14 tag values which may be defined. Since only language tags are currently provided with an explicit tag identification character, only language tags are currently affected.
The main function of CANCEL TAG is to make possible such operations as blind concatenation of strings in a tagged context without the propagation of inappropriate tag values across the string boundaries. For example, a string tagged with a Japanese language tag can have its tag value "sealed off" with a terminating CANCEL TAG before another string of unknown language value is concatenated to it. This would prevent the string of unknown language from being erroneously marked as being Japanese simply because of a concatenation to a Japanese string.
For debugging or other operations which must render the tags themselves visible, it is advisable that the tag characters be rendered using the corresponding ASCII character glyphs (perhaps modified systematically to differentiate them from normal ASCII characters). But, as noted below, the tag character values are chosen so that even without display support, the tag characters will be interpretable in most debuggers.
So for a non-TagAware Unicode application, any language tag characters (or any other kind of tag expressed with Plane 14 tag characters) encountered would be handled exactly as for uninterpreted Tibetan from the BMP, uninterpreted Linear B from Plane 1, or uninterpreted Egyptian hieroglyphics from private use space in Plane 15.
A TagAware but TagPhobic Unicode application can recognize the tag character range in Plane 14 and choose to deliberately strip them out completely to produce plain text with no tags.
The presence of a correctly formed tag cannot be taken as an absolute guarantee that the data so tagged is actually correctly tagged. For example, nothing prevents an application from erroneously labelling French data as Spanish, or from labelling JIS-derived data as Japanese, even if it contains Greek or Cyrillic characters.
Unicode encodes scripts, not languages.This is still true of the Unicode encoding (and ISO/IEC 10646), even in the presence of a mechanism for specifying language tags in plain text.
Language tagging in no way impacts current encoded characters or the encoding of future scripts.
It is fully anticipated that implementations of Unicode which already make use of out-of-band mechanisms for language tagging or "heavy-weight" in-band mechanisms such as HTML will continue to do exactly what they are doing and will ignore Plane 14 tag characters completely.
There is nothing obligatory about the use of Plane 14 tags, whether for language tags or any other kind of tags. This technical report for Plane 14 tags is, instead, aimed at removing a significant barrier to the universal adoption of Unicode in such arenas as Internet protocol development.
1. Semantic constraints are specified by rules in the form of an assertion specified between double braces; the variable $$ denotes the string consisting of all terminal symbols matched by the this non-terminal.
Example: {{ Assert ( $$[0] == '?' ); }}
Meaning: The first character of the string matched by this non-terminal must be '?'2. A number of predicate functions are employed in semantic constraint rules which are not otherwise defined; their name is sufficient for determining their predication.
Example: IsRFC1766LanguageIdentifier ( tag-argument )
Meaning: tag-argument is a valid RFC1766 language identifier3. A lexical expander function, TAG, is employed to denote the tag form of an ASCII character; the argument to this function is either a character or a character set specified by a range or enumeration expression.
Example: TAG('-')
Meaning: TAG HYPHEN-MINUS
Example: TAG([A-Z])
Meaning: TAG LATIN CAPITAL LETTER A ... TAG LATIN CAPITAL LETTER Z4. A macro is employed to denote terminal symbols that are character literals which can't be directly represented in ASCII. The argument to the macro is the UNICODE (ISO/IEC 10646) character name.
Example: '${TAG CANCEL}'
Meaning: character literal whose code value is U-000E007F5. Occurrence indicators used are '+' (one or more) and '*' (zero or more); optional occurrence is indicated by enclosure in '[' and ']'.
tag : language-tag | cancel-all-tag ;
language-tag : language-tag-introducer language-tag-argument ;
language-tag-argument : tag-argument {{ Assert ( IsRFC1766LanguageIdentifier ( $$ ); }} | tag-cancel ;
cancel-all-tag : tag-cancel ;
tag-argument : tag-character+ ;
tag-character : { c : c in TAG( { a : a in printable ASCII characters or SPACE } ) } ;
language-tag-introducer : '${TAG LANGUAGE}' ;
tag-cancel : '${TAG CANCEL}' ;
ISO/IEC 10646-1:1993 International Organization for Standardization. "Information Technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane", Geneva, 1993.[RFC1766]
Alvestrand, H., "Tags for the Identification of Languages", RFC 1766.[RFC2070]
F. Yergeau, G. Nicol, G. Adams, and M. Duerst, "Internationalization of the Hypertext Markup Language", RFC 2070, January 1997.[RFC2119]
S. Bradner, "Key words for use in RFCs to Indicate Requirement Levels", RFC 2119, March 1997.[RFC 2130]
C. Weider, C. Preston, K. Simonsen, H. Alvestrand, R. Atkinson, M. Crispin, and P. Svanberg, "The Report of the IAB Character Set Workshop held 29 February - 1 March, 1996", RFC 2130, April 1997.[UNICODE]
The Unicode Standard, Version 2.0, The Unicode Consortium, Addison-Wesley, July 1996.
Glenn Adams
Gemstar International Group Limited
209 Burlington Rd,
Bedford, MA 01730
Phone: +1 781-276-8644
Fax: +1 781-276-8878
Email: glenn@unicode.org
if ( ( *s == 0xF3 ) && ( *(s+1) == 0xA0 ) && ( *(s+2) & 0xE0 == 0x80 )
in the section RANGE CHECKING FOR TAG CHARACTERS was updated to:
if ( ( *s == 0xF3 ) && ( *(s+1) == 0xA0 ) && ( *(s+2) & 0xFE == 0x80 ) )
Copyright © 1998-1999 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.