Part of L2/99-181

Proposed DRAFT Unicode Technical Report #20

Characters not Suitable for Markup

Revision	1
Authors	Martin D�rst (mduerst@w3.org), Mark Davis (mark@unicode.org), Hideki= Hiura (hideki.hiura@eng.sun.com)
Date	1999-06-08
This Version	most probably: http://www.unicode.org/unicode/reports/tr20/tr20-1.html
Previous Version	none
Latest Version	http://www.unicode.org/unicode/reports/tr20

Summary

This document contains guidelines in order to avoid the use of= certain characters in markup.

Status of this document

This proposed draft is published for review purposes. This draft has= not yet been considered by the Unicode Technical Committee. At its next= meeting, the Unicode Technical Committee may approve, reject, or further amend= this document.

The content of technical reports must be understood in the context of the latest version of the Unicode Standard. See http://www.unicode.org/unicode/standard/versions/ for more information.

This document does not, at this time, imply any endorsement by= the Consortium's staff or member organizations. Please mail comments to unicore@unicode.org.

Introduction
- Notation
General Considerations
List of Characters
Versioning
Conformance

1. Introduction

The Unicode Standard contains a large number of characters in order to= cover the scripts of the world. It also contains characters for compatibility= with older character encodings, and characters with control-like functions= included for various reasons.

For document and data interchange, the Internet and the World Wide Web= is more and more making use of marked-up text. The principles of marked-up= text can interfere with some control-like characters in various undesirable= ways.

[a more extensive overview of Unicode and markup will be added to level= out the background of various audiences]

Notation

This report uses XML as a prominent and general example of markup. The= XML namespace notation [Namespace] is used to indicate that a certain element is taken from a specific markup language. As an example, the prefix= 'html:' indicates that this element is taken from [XHTML]. This means that the= examples containing the namespace prefix 'html:' are assumed to include a= namespace declaration of xmlns:html="..." (insert the appropriate URI for XHTML= later).

Characters are denoted using the notation used in the Unicode Standard,= i.e. U+ followed by their hexadecimal number. [Should this be replaced by the XML convention? Probably not, because we don't want to see these in XML= :-)]

2. General Considerations

This chapter will contain general considerations regarding= control-like characters in markup. In particular, it is planned to address the= following points:

Linearity of text vs. hierarchy of markup structure
Coincidence (in most cases) of semantic markup and functions of control characters (e.g. <html:q> for insertions of fragments from= another language,...)
Extensibility of markup
Problems with structural alignement between markup and control= characters
Ambiguity or interference of control characters in markup= source

3. List of Characters

The following table contains the characters currently considered not= suitable for use with markup. Each category is further discussed below.

Codepoints	Names/Description	Short Comment
U+202A .. U+202E	BIDI embedding controls (LRE, RLE, LRO, RLO, PDF)	Strongly discouraged in HTML 4.0; RLM and LRM are allowed
U+2028 .. U+2029	Line and paragraph separator (under discussion)	use <html:br />, <html:p></html:p>, or= equivalent
U+206A .. U+206B	symmetric swapping	Strongly discouraged in Unicode 2.0
U+206C .. U+206D	Arabic form shaping	Strongly discouraged in Unicode 2.0
U+206E .. U+206F	National digit shapes	Strongly discouraged in Unicode 2.0
U+FFF9 .. U+FFFB	Interlinear annotation controls	Use ruby markup
U+FFFC	Object replacement character (under discussion)	Use markup, e.g. HTML <object>
U+1xxxx????	Language Tag codepoints (if and when they will be encoded)	Use html:lang or xml:lang

A later version of this document will discuss each of the character= categories. For each of the categories/characters, the following points may= be mentionned/discussed:

Short description of semantics
Reason for inclusion in Unicode
Specific problems when used with markup
Other areas where problems may occur (e.g. plain text)
What kind of markup to use in place
What to if detected (remove/ignore/replace/complain,...)

The following subsection gives an example:

Object Replacement Character, U+FFFC

Short description: The object replacement character is used to= stand in place of an object (e.g. an image) included in a text.

Reason for inclusion: The object replacement character was= included in Unicode only in order to reserve a codepoint for a very= frequent application-internal use. Many text-processing applications store the= text and the associated markup (or in some cases styling information) of a= document in separate structures. The actual text is kept in a single linear= structure; additional information is kept separately with pointers to the= appropriate text positions. The overall implementation makes sure that these two= structures are kept in sync. If the text contains objects such as images, it is= extremely helpful for implementations to have a sentinel in the text itself; any= additional information is kept separately.

Problems when used in markup: Including an object replacement= character in markup text does not work because the additional information (what= object to include,...) is not available.

Problems with other uses: The object replacement character is= also problematic when used in plain text.

Replacement markup: The markup to be used in place of the= Object Replacement Character depends on the object in question and the markup= context it is used in. Typical cases are <html:img src='...' />,= <html:object ...>, or <html:applet ...>. These constructs allow to provide= all additional information needed to identify and use the object in= question.

What to do if detected: In a proxy context or browser context,= ignore. When received in an editing context, remove, maybe with a warning to= the user.

4. Versioning

When this report is finalized, it will treat all relevant characters in= the then current version of the Unicode Standard, and it may include some= others whose addition is anticipated/planned/feared.

As the Unicode standard is updated and new characters get added, new= characters that are not suitable for markup may also be added. However, it is= hoped that this report will help to reduce such additions as much as= possible. These characters will be flagged as such in the appropriate datafile.= This file should always be checked to have the most up-to-date information.= This report itself may be updated periodically to give additional= background information.

For more information, see:

Versions of= the Unicode Standard (http://www.unicode.org/unicode/standard/versions)
Unicode Character Database Format (ftp://ftp.unicode.org/Public/2.1-Update3/ReadMe-2.1.8.txt)
Unicode Character Database = (ftp://ftp.unicode.org/Public/2.1-Update3/UnicodeData-2.1.8.txt)

[add a pointer to the latest data file once we have one]

Conformance

This document does not specify any kind of conformance clause. However,= other documents may specify conformance including references to this= document.

References

(to be= completed)

[Charmod]

[Charreq]

[Namespace]

[HTML]

[Unicode]

[XHTML]

[XML]

Copyright

The Unicode Consortium makes no expressed or implied warranty of any= kind, and assumes no liability for errors or omissions. No liability is= assumed for incidental and consequential damages in connection with or arising= out of the use of the information or programs contained or accompanying= this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are= registered in some jurisdictions.

Unicode Home Page: http://www.unicode.org

Unicode Technical Reports: http://www.unicode.org/unicode/reports/