Revision | 1 |
Authors | Martin Dürst (mduerst@w3.org), Mark Davis (mark@unicode.org), Hideki= Hiura (hideki.hiura@eng.sun.com) |
Date | 1999-06-08 |
This Version | most probably: http://www.unicode.org/unicode/reports/tr20/tr20-1.html |
Previous Version | none |
Latest Version | http://www.unicode.org/unicode/reports/tr20 |
This document contains guidelines in order to avoid the use of= certain characters in markup.
This proposed draft is published for review purposes. This draft has= not yet been considered by the Unicode Technical Committee. At its next= meeting, the Unicode Technical Committee may approve, reject, or further amend= this document.
The content of technical reports must be understood in the context of the latest version of the Unicode Standard. See http://www.unicode.org/unicode/standard/versions/ for more information.
This document does not, at this time, imply any endorsement by= the Consortium's staff or member organizations. Please mail comments to unicore@unicode.org.
The Unicode Standard contains a large number of characters in order to= cover the scripts of the world. It also contains characters for compatibility= with older character encodings, and characters with control-like functions= included for various reasons.
For document and data interchange, the Internet and the World Wide Web= is more and more making use of marked-up text. The principles of marked-up= text can interfere with some control-like characters in various undesirable= ways.
[a more extensive overview of Unicode and markup will be added to level= out the background of various audiences]
This report uses XML as a prominent and general example of markup. The= XML namespace notation [Namespace] is used to indicate that a certain element is taken from a specific markup language. As an example, the prefix= 'html:' indicates that this element is taken from [XHTML]. This means that the= examples containing the namespace prefix 'html:' are assumed to include a= namespace declaration of xmlns:html="..." (insert the appropriate URI for XHTML= later).
Characters are denoted using the notation used in the Unicode Standard,= i.e. U+ followed by their hexadecimal number. [Should this be replaced by the XML convention? Probably not, because we don't want to see these in XML= :-)]
This chapter will contain general considerations regarding= control-like characters in markup. In particular, it is planned to address the= following points:
The following table contains the characters currently considered not= suitable for use with markup. Each category is further discussed below.
Codepoints |
Names/Description |
Short Comment |
---|---|---|
U+202A .. U+202E | BIDI embedding controls (LRE, RLE, LRO, RLO, PDF) | Strongly discouraged in HTML 4.0; RLM and LRM are allowed |
U+2028 .. U+2029 | Line and paragraph separator (under discussion) | use <html:br />, <html:p></html:p>, or= equivalent |
U+206A .. U+206B | symmetric swapping | Strongly discouraged in Unicode 2.0 |
U+206C .. U+206D | Arabic form shaping | Strongly discouraged in Unicode 2.0 |
U+206E .. U+206F | National digit shapes | Strongly discouraged in Unicode 2.0 |
U+FFF9 .. U+FFFB | Interlinear annotation controls | Use ruby markup |
U+FFFC | Object replacement character (under discussion) | Use markup, e.g. HTML <object> |
U+1xxxx???? | Language Tag codepoints (if and when they will be encoded) | Use html:lang or xml:lang |
A later version of this document will discuss each of the character= categories. For each of the categories/characters, the following points may= be mentionned/discussed:
The following subsection gives an example:
Short description: The object replacement character is used to= stand in place of an object (e.g. an image) included in a text.
Reason for inclusion: The object replacement character was= included in Unicode only in order to reserve a codepoint for a very= frequent application-internal use. Many text-processing applications store the= text and the associated markup (or in some cases styling information) of a= document in separate structures. The actual text is kept in a single linear= structure; additional information is kept separately with pointers to the= appropriate text positions. The overall implementation makes sure that these two= structures are kept in sync. If the text contains objects such as images, it is= extremely helpful for implementations to have a sentinel in the text itself; any= additional information is kept separately.
Problems when used in markup: Including an object replacement= character in markup text does not work because the additional information (what= object to include,...) is not available.
Problems with other uses: The object replacement character is= also problematic when used in plain text.
Replacement markup: The markup to be used in place of the= Object Replacement Character depends on the object in question and the markup= context it is used in. Typical cases are <html:img src='...' />,= <html:object ...>, or <html:applet ...>. These constructs allow to provide= all additional information needed to identify and use the object in= question.
What to do if detected: In a proxy context or browser context,= ignore. When received in an editing context, remove, maybe with a warning to= the user.
When this report is finalized, it will treat all relevant characters in= the then current version of the Unicode Standard, and it may include some= others whose addition is anticipated/planned/feared.
As the Unicode standard is updated and new characters get added, new= characters that are not suitable for markup may also be added. However, it is= hoped that this report will help to reduce such additions as much as= possible. These characters will be flagged as such in the appropriate datafile.= This file should always be checked to have the most up-to-date information.= This report itself may be updated periodically to give additional= background information.
For more information, see:
[add a pointer to the latest data file once we have one]
This document does not specify any kind of conformance clause. However,= other documents may specify conformance including references to this= document.
(to be= completed)
[Charmod]
[Charreq]
[Namespace]
[HTML]
[Unicode]
[XHTML]
[XML]
Copyright © 1998-1998 Unicode, Inc. All Rights Reserved.
The Unicode Consortium makes no expressed or implied warranty of any= kind, and assumes no liability for errors or omissions. No liability is= assumed for incidental and consequential damages in connection with or arising= out of the use of the information or programs contained or accompanying= this technical report.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are= registered in some jurisdictions.
Unicode Home Page: http://www.unicode.org
Unicode Technical Reports: http://www.unicode.org/unicode/reports/