ISO - International Organization for Standardization
ISO/IEC JTC1/SC2/WG2
Universal Multiple-Octet Coded Character Set
(UCS)
ISO/IEC JTC1/SC2/WG2 N1882
Date:Sep 23, 1998
Title | Support for Implementing Interlinear Annotations |
Source | US (Ansi) |
References | N1727, N1861 |
Action | To be considered by SC2/WG2 |
Distribution | ISO/IEC JTC1/SC2/WG2 |
Summary
This report presents the specifications of three new characters for implementing interlinear annotations. It is an update of document N 1727 and has taken into account feedback provided by the Japanese Member Body in the document N1861.
This is a complete proposal to encode three characters for use with implementations of Interlinear Annotations in the Specials block, preceding the OBJECT REPLACEMENT CHARACTER. This location is proposed because of the similarity of intended usage to the OBJECT REPLACEMENT CHARACTER.
Annotation: text that is part of the content, but for all or some text processing does not form part of the main text stream.
Annotation base character: base characters are those characters from the main text stream to which the annotation applies. In all regular editing and text processing algorithms these characters are treated as part of the text stream.
Interlinear Annotation Object: An interlinear annotation object is an in text-stream object that is like the objects that are supported via the object replacement character, except for two important differences.
A conformant implementation that supports these new three characters, interprets the base characters as if they occurred in an un-annotated text stream. It interprets annotation characters as annotation character, that is subject to additional information about the annotation object and stored out of band. If such an implementation choses to remove these three characters, it should remove all of them, as well as the annotation characters. However the annotation base characters should be preserved as they form an integral part of the text stream.
The entities in question are 'objects' from the perspective of the line layout algorithm. For the same reason that FFFC was added as an 'object replacement' character, implementation for these objects can be regularized by the presence of a dummy character. Unlike image or audio objects, these inline or interlinear objects carry textual information. Therefore the line layout algorithm is applied recursively to these sublines. Implementations are simplified with the presence of separator and terminator characters. For implementations that purport to support any Unicode character, including the Private Use Area for use as EUDC (end user defined characters), it is important to have reserved character codes to support interlinear annotation objects.
Recall that the U+FFFC OBJECT REPLACEMENT CHARACTER was intended to act as an anchor point for non-textual formatting information. The same would be true for the proposed INTERLINEAR ANNOTATION ANCHOR and the formatting information related to the interlinear annotation object itself (as opposed to the formatting of the sublines).
Three characters are proposed to satisfy the implementation and interchange concerns below.
x+FFF9 INTERLINEAR ANNOTATION ANCHOR
x+FFFA INTERLINEAR ANNOTATION SEPARATOR
x+FFFB INTERLINEAR ANNOTATION TERMINATOR
x+FFF9 is intended to be used as an anchor character, preceding the interlinear annotation object. The exact nature and formatting of the annotation is dependent on additional information that is not part of the plain text stream. This is analogous to U+FFFC OBJECT REPLACEMENT CHARACTER.
x+FFFA is intended to separate the base characters in the text stream from the annotation characters that follow. The exact interpretation of this character depends on the nature of the annotation. More than one separator may be present. If none are present, the interlinear annotation object contains no base annotation characters.
x+FFFB is used to terminate the annotation object (and returns to the regular text stream).
A dashed box with the inscribed letters AA, AS and AT respectively.
Example: This example uses the interlinear annotation characters to implement a Ruby interlinear annotation object. The following phrase contains a Ruby The preceding phrase could be encoded as |
A mathematical equation containing an integral could be implemented as an annotation as follows
x+FFF9 U+222B INTEGRAL <integrand> x+FFFA <lower limit> x+FFFA <upper limit> x+FFFB
Bibliographic records often contain information that is not part of the searchable string, but additional information. ISO TC46 standards provide for and BEGIN ANNOTATION and END ANNOTATION control. These are trivially implemented via x+FFF9 and x+FFFB.
Like the use of FFFC the use of these characters does not by themselves make it possible to interchange final form documents. That is, all the object and text specific formatting are gone. Unlike non-text objects, the legible information in the annotation is retained and the content relation between characters (base and annotation) can be maintained.
Existing implementations have internal limits for the number of base characters or number of annotation character that they can support. These limits tend to be large enough for practical use. Therefore, there is no requirement for an implementation to handle any specific length of annotation or base.
Clustering characters
It may be desirable to create separate character clusters for formatting purposes. In that case, the following character may be used:
U+200B ZERO WIDTH SPACE
For example, this is useful for creating ruby groups where the furigana string is associated with a long Kanji expression to express the association of one or several Kanji characters with a specific subgroup of furigana characters. The grouping is created by inserting appropriately the ZERO WIDTH SPACE characters in the annotation base expression and/or in the annotation itself.
Disallowed characters
The following characters may not occur in either base or annotation:
U+2028 LINE SEPARATOR
U+2029 PARAGRAPH SEPARATOR
U+0000U+001F C0 Controls
U+0080U+009F C1 Controls
If an implementation encounters any of these characters between an ANNOTATION ANCHOR and its corresponding ANNOTATION TERMINATOR it may terminate or disregard any open annotations at this point.
Not recommended characters
Interlinear objects are meant to create an association between a base textual expression and a textual annotation. Although possible, they should not be used to set a text emphasis, using symbol characters containing dots or lines (like the East Asian emphasis mark, called ..). Instead, an out of band formatting process should be used.
Unpaired Delimiters
ANNOTATION ANCHOR characters must precede their corresponding ANNOTATION TERMINATOR characters. Unpaired ANCHORS or TERMINATORS may be ignored.
Mis-scoped Separators
ANNOTATION SEPARATOR occurring outside a pair of delimiters, are ignored.
Recursion
Annotations can be recursive.
All formatting information for an annotation is un-encoded in the plain text stream. Therefore, annotation markers serve as placeholders for an implementation that has access to that information from another source. They are not formatting commands. As mentioned above, the ZERO WIDTH SPACE may be used to create annotation subgroups to facilitate selective formatting.
Annotations are typically ignored for collation, or optionally preprocessed to act as tie breakers only. However, importantly, annotation base characters are not ignored, but treated like regular text.