Part of L2/99-181

Proposed DRAFT Unicode Technical Report #20

Characters not Suitable for Markup

Revision 1
Authors Martin Dürst (mduerst@w3.org), Mark Davis (mark@unicode.org), Hideki= Hiura (hideki.hiura@eng.sun.com)
Date 1999-06-08
This Version most probably: http://www.unicode.org/unicode/reports/tr20/tr20-1.html
Previous Version none
Latest Version http://www.unicode.org/unicode/reports/tr20

Summary

This document contains guidelines in order to avoid the use of= certain characters in markup.

Status of this document

This proposed draft is published for review purposes. This draft has= not yet been considered by the Unicode Technical Committee. At its next= meeting, the Unicode Technical Committee may approve, reject, or further amend= this document.

The content of technical reports must be understood in the context of the latest version of the Unicode Standard. See http://www.unicode.org/unicode/standard/versions/ for more information.

This document does not, at this time, imply any endorsement by= the Consortium's staff or member organizations. Please mail comments to unicore@unicode.org.

Table of Contents

  1. Introduction
  2. General Considerations
  3. List of Characters
  4. Versioning
  5. Conformance

1. Introduction

The Unicode Standard contains a large number of characters in order to= cover the scripts of the world. It also contains characters for compatibility= with older character encodings, and characters with control-like functions= included for various reasons.

For document and data interchange, the Internet and the World Wide Web= is more and more making use of marked-up text. The principles of marked-up= text can interfere with some control-like characters in various undesirable= ways.

[a more extensive overview of Unicode and markup will be added to level= out the background of various audiences]

Notation

This report uses XML as a prominent and general example of markup. The= XML namespace notation [Namespace] is used to indicate that a certain element is taken from a specific markup language. As an example, the prefix= 'html:' indicates that this element is taken from [XHTML]. This means that the= examples containing the namespace prefix 'html:' are assumed to include a= namespace declaration of xmlns:html="..." (insert the appropriate URI for XHTML= later).

Characters are denoted using the notation used in the Unicode Standard,= i.e. U+ followed by their hexadecimal number. [Should this be replaced by the XML convention? Probably not, because we don't want to see these in XML= :-)]

2. General Considerations

This chapter will contain general considerations regarding= control-like characters in markup. In particular, it is planned to address the= following points:

3. List of Characters

The following table contains the characters currently considered not= suitable for use with markup. Each category is further discussed below.

Codepoints

Names/Description

Short Comment

U+202A .. U+202E BIDI embedding controls (LRE, RLE, LRO, RLO, PDF) Strongly discouraged in HTML 4.0; RLM and LRM are allowed
U+2028 .. U+2029 Line and paragraph separator (under discussion) use <html:br />, <html:p></html:p>, or= equivalent
U+206A .. U+206B symmetric swapping Strongly discouraged in Unicode 2.0
U+206C .. U+206D Arabic form shaping Strongly discouraged in Unicode 2.0
U+206E .. U+206F National digit shapes Strongly discouraged in Unicode 2.0
U+FFF9 .. U+FFFB Interlinear annotation controls Use ruby markup
U+FFFC Object replacement character (under discussion) Use markup, e.g. HTML <object>
U+1xxxx???? Language Tag codepoints (if and when they will be encoded) Use html:lang or xml:lang

A later version of this document will discuss each of the character= categories. For each of the categories/characters, the following points may= be mentionned/discussed:

The following subsection gives an example:

Object Replacement Character, U+FFFC

Short description: The object replacement character is used to= stand in place of an object (e.g. an image) included in a text.

Reason for inclusion: The object replacement character was= included in Unicode only in order to reserve a codepoint for a very= frequent application-internal use. Many text-processing applications store the= text and the associated markup (or in some cases styling information) of a= document in separate structures. The actual text is kept in a single linear= structure; additional information is kept separately with pointers to the= appropriate text positions. The overall implementation makes sure that these two= structures are kept in sync. If the text contains objects such as images, it is= extremely helpful for implementations to have a sentinel in the text itself; any= additional information is kept separately.

Problems when used in markup: Including an object replacement= character in markup text does not work because the additional information (what= object to include,...) is not available.

Problems with other uses: The object replacement character is= also problematic when used in plain text.

Replacement markup: The markup to be used in place of the= Object Replacement Character depends on the object in question and the markup= context it is used in. Typical cases are <html:img src='...' />,= <html:object ...>, or <html:applet ...>. These constructs allow to provide= all additional information needed to identify and use the object in= question.

What to do if detected: In a proxy context or browser context,= ignore. When received in an editing context, remove, maybe with a warning to= the user.

4. Versioning

When this report is finalized, it will treat all relevant characters in= the then current version of the Unicode Standard, and it may include some= others whose addition is anticipated/planned/feared.

As the Unicode standard is updated and new characters get added, new= characters that are not suitable for markup may also be added. However, it is= hoped that this report will help to reduce such additions as much as= possible. These characters will be flagged as such in the appropriate datafile.= This file should always be checked to have the most up-to-date information.= This report itself may be updated periodically to give additional= background information.

For more information, see:

[add a pointer to the latest data file once we have one]

Conformance

This document does not specify any kind of conformance clause. However,= other documents may specify conformance including references to this= document.

References

(to be= completed)

[Charmod]

[Charreq]

[Namespace]

[HTML]

[Unicode]

[XHTML]

[XML]

Copyright

Copyright © 1998-1998 Unicode, Inc. All Rights Reserved.

The Unicode Consortium makes no expressed or implied warranty of any= kind, and assumes no liability for errors or omissions. No liability is= assumed for incidental and consequential damages in connection with or arising= out of the use of the information or programs contained or accompanying= this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are= registered in some jurisdictions.


Unicode Home Page: http://www.unicode.org

Unicode Technical Reports: http://www.unicode.org/unicode/reports/