NCITS/L2/01- 319R

Date: September 10, 2001


Title:	White-space processing
Source:	Michel Suignard (Microsoft)
Action:	FYI
Distribution	NCITS/L2 and UTC

Following are two documents extract.

The first one comes from XHTML Modularization http://www.w3.org/TR/xhtml-modularization/ and expresses the conformance applying to whitespace processing. It has been augmented with an additional note (Note 1) which is not yet part of XHTML but has been worked on by the W3C I18n Group. There are still some issues about the meaning of the ‘removal’ of white space characters. Are they removed from the infoset, or are they just removed from a rendering point of view?

The second one comes from a CSS3 Text module (link not yet public) which describes a proposal for white-space processing in the context of future versions of CSS.

These two extracts hint at further specification work in classifying Unicode characters according to a white-space processing. The author and the relevant W3C WG would like to get preliminary feedback on the current text before engaging in further development in these matters.

Michel Suignard

XHTML Modularization

3.5. XHTML Family User Agent Conformance

1…

9. A conforming user agent must meet all of the following criteria (as defined in [XHTML1]):

In order to be consistent with the XML 1.0 Recommendation [XML], the user agent must parse and evaluate an XHTML document for well-formedness. If the user agent claims to be a validating user agent, it must also validate documents against their referenced DTDs according to [XML].
When the user agent claims to support facilities defined within this specification or required by this specification through normative reference, it must do so in ways consistent with the facilities' definition.
When a user agent processes an XHTML document as generic [XML], it shall only recognize attributes of type ID (e.g., the id attribute on most XHTML elements) as fragment identifiers.
If a user agent encounters an element it does not recognize, it must continue to process the children of that element. If the content is text, the text must be presented to the user.
If a user agent encounters an attribute it does not recognize, it must ignore the entire attribute specification (i.e., the attribute and its value).
If a user agent encounters an attribute value it doesn't recognize, it must use the default attribute value.
If it encounters an entity reference (other than one of the predefined entities) for which the user agent has processed no declaration (which could happen if the declaration is in the external subset which the user agent hasn't read), the entity reference should be rendered as the characters (starting with the ampersand and ending with the semi-colon) that make up the entity reference.
When rendering content, user agents that encounter characters or character entity references that are recognized but not renderable should display the document in such a way that it is obvious to the user that normal rendering has not taken place.
White space is handled according to the following rules. The following characters are defined in [XML] as white space characters:

SPACE ( )
HORIZONTAL TABULATION (	)
CARRIAGE RETURN ()
LINE FEED (
)

The XML processor normalizes different systems' line end codes into one single LINE FEED character, that is passed up to the application.

The user agent must process white space characters in the data received from the XML processor as follows:

All white space surrounding block elements should be removed.
Comments are removed entirely and do not affect white space handling. One white space character on either side of a comment is treated as two white space characters.
When the 'xml:space' attribute is set to 'preserve', white space characters must be preserved and consequently LINE FEED characters within a block must not be converted.
When the 'xml:space' attribute is not set to 'preserve', then:

Leading and trailing white space inside a block element must be removed.
LINE FEED characters must be converted into one of the following characters: a SPACE character, a ZERO WIDTH SPACE character (), or no character (i.e. removed). The choice of the resulting character is user agent dependent and is conditioned by the script property of the characters preceding and following the LINE FEED character.
A sequence of white space characters without any LINE FEED characters must be reduced to a single SPACE character.
A sequence of white space characters with one or more LINE FEED characters must be reduced in the same way as a single LINE FEED character.

White space in attribute values is processed according to [XML].

In determining how to convert a LINE FEED character a user agent must meet the following rules, whereby the script of characters on either side of the LINE FEED determines the choice of the replacement. The assignment of script names to all characters is done in accordance to the Unicode [UNICODE] technical report TR#24 (Script Names).

1. COMMON script characters (such as punctuation) are treated the same as if they belong to the same script as the character on the other side.

2. INHERITED script characters (such as combining characters) are treated as if they belong to the same script as the previous character if preceding the LINE FEED character or the same script as the character on the preceding side if following the LINE FEED character.

3. If the characters preceding and following the LINE FEED character belong to HAN, HIRAGANA or KATAKANA script, the LINE FEED must be converted into no character.

4. If the characters preceding and following the LINE FEED character belong to KHMER, LAO, MYANMAR or THAI script, the LINE FEED must be converted into a ZERO WIDTH SPACE character () or no character. (This rule may be extended in the future to additional scripts, not yet encoded, that do not separate their words by space characters)

5. If none of the conditions in (3) through (4) are true, the LINE FEED character must be converted into a SPACE character. This covers the case of many scripts like LATIN, CYRILLIC, GREEK, etc. This also covers the case that only the COMMON script has been detected on both sides, and the case that scripts belong to different categories.

Note (informative): Some scripts, such as HAN, HIRAGANA, KATAKANA, KHMER, LAO, MYANMAR, THAI do not use space characters for word boundary delimitation, but may still use these space characters for delimitation of sentences or fragments of sentence. If such a character occurs as the last character before a LINE FEED character, or a character following a LINE FEED character, it may be eliminated by the white space processing described above. Several solutions are possible:

Make sure that these delimitation space characters do not occur next to a LINE FEED character and that authoring tools either do not reflow text in these scripts, or only reflow such text while preserving the white space in a way appropriate for these scripts.
Ensure that a ZERO WIDTH SPACE is inserted after each delimitation space and that authoring tools preserve both SPACE and ZERO WIDTH SPACE together at the end of a line during reflow.

CSS3

8.1. Text wrapping: the 'wrap-option' property

Name:	'wrap-option'
Value:	no-wrap \| wrap \| inherit
Initial:	wrap
Applies to:	all elements
Inherited:	yes
Percentages:	N/A
Media:	visual

This property controls whether or not text wraps when it reaches the flow edge of its containing block box

wrap

Line-breaking occurs if the line overflows the available block width. The specific line breaking algorithm is determined by the 'line-break' and word-break' properties.

no-wrap

No line wrapping is performed. In the case when lines are longer than the available block width, the overflow will be treated in accordance with the 'overflow' property specified in the element.

8.2. White-space control: the 'linefeed-treatment', 'space-treatment', 'white-space-treatment' properties and the 'white-space' shortcut property

White-space processing in the context of CSS is the mechanism by which all white-space characters are interpreted for rendering purpose. The white-space set is determined by the XML specification as being a combination of one or more space characters (Unicode value U+0020), carriage returns (U+000D), line feeds (U+000A), or tabs (U+0009).

The amount of white space processing that can be achieved by a user agent that supports CSS is directly related to the CSS processing model, especially the document parsing and validation. After parsing and possible validation, the document tree may contain text nodes that contain unprocessed white space characters, or the document tree may already have been processed in a way that white space characters have been collapsed and partially removed (white space normalization).

In that respect, the CSS properties related to white space processing can only be effective if the CSS processor has access to the white space characters that were originally encoded in the document. However, end-of-line characters are typically handled (like by XML processors) in such a way that any arbitrary combination of end-of-line characters is replaced by a single line feed character (U+000A).

Note: XML Schema, through its 'whiteSpace' facet can constrain exactly the type of white space characters still available to a rendering process like CSS for elements containing string datatype. In addition, some XML languages like XHTML may have their own white-space processing rules when parsing and validating documents with white-space characters. Therefore, some of the behaviors described below may be affected by these limitations and may be user agent dependent in these contexts.

The typical white-space processing, similar to XHTML-MOD is as follows:

Leading and trailing white space inside a block element are not rendered.
Line feed characters are rendered as one of the following characters: a space character, a zero width space character (U+200B), or no character (i.e. not rendered). The choice of the resulting character is conditioned by the script property of the characters preceding and following the line feed character.
A sequence of white space characters without any line feed characters is rendered as a single space character.
A sequence of white space characters with one or more line feed character is rendered similarly to a single line feed character.

Note: These rendering rules make no assumption about the storage model of these white-space character sequences. It is outside the scope of CSS to determine the character code values accessible through programming interface such as DOM.

The following properties: 'linefeed-treatment', 'space-treatment' and 'white-space-treatment' allow precise controls of that typical behavior. The 'linefeed-treatment' determines the rendering of the line feed characters. The 'space-treatment' determines the rendering of white space character (except line feed). And the 'white-space-collapse' property determines the treatment of consecutive white-space characters after consideration of the two prior properties. The 'white-space' property is redefined as a shortcut property which sets the values of these three new properties as well as the 'wrap-option' property (the latter for compatibility reason with earlier versions of CSS).

Name:	'linefeed-treatment'
Value:	auto \| ignore \| preserve \| treat-as-space \| treat-as-zero-width-space \| inherit
Initial:	treat-as-space
Applies to:	all elements
Inherited:	yes
Percentages:	N/A
Media:	visual

This property specifies the treatment of linefeeds (U+000A characters). Values have the following meanings:

auto

Linefeed characters are transformed for rendering purpose into one of the following characters: a space character, a zero width space character (U+200B), or no character (i.e. not rendered). The choice of the resulting character is conditioned by the script property of the characters preceding and following the line feed character in the same line flow elements part of the same block element. The result of the transformation can be treated by subsequent CSS processing (including white space collapsing).

ignore

Linefeed characters are ignored. i.e. they are transformed for rendering purpose into no character.

preserve

Linefeed characters indicate a an end of line of boundary.

treat-as-space

Linefeed characters are transformed for rendering purpose into a space character (U+0020). The result of the transformation can be treated by subsequent CSS processing (including white space collapsing).

treat-as-zero-width-space

Linefeed characters are transformed for rendering purpose into a zero width space character (U+200B). The result of the transformation can be treated by subsequent CSS processing (including white space collapsing).

Note: The Unicode Standard recommends that the zero width space is considered a valid line-break point and that if two characters with a zero width space in between are placed on the same line they are placed with no space between them and that if they are placed on two lines no additional glyph area, such as for a hyphen, is created at the line-break.

Name:	'space-treatment'
Value:	ignore \| preserve ignore-if-before-linefeed \| ignore-if-after-linefeed \| ignore-if-surrounding-linefeed \| inherit
Initial:	preserve
Applies to:	all elements
Inherited:	yes
Percentages:	N/A
Media:	visual

This property specifies the treatment of space (U+0020) and other white space characters except for linefeeds (U+000A), since their treatment is determine by the linefeed-treatment property. Values have the following meanings:

ignore

White space characters, except for linefeeds, are ignored. i.e. they are transformed for rendering purpose into no character.

preserve

All white space characters are rendered as intended. The tab character (U+0009) is rendered as the smallest non-zero number of spaces necessary to line characters up along tab stops that are every 8 characters. The treatment of linefeeds is not determined by this property.

ignore-if-before-linefeed

Specifies that any white space characters, except for linefeeds, that immediately precedes a linefeed character, shall be discarded. This action shall take place regardless of the setting of the linefeed-treatment property.

ignore-if-after-linefeed

Specifies that any white space characters, except for linefeeds, that immediately follows a linefeed character, shall be discarded. This action shall take place regardless of the setting of the linefeed-treatment property.

ignore-if-surrounding-linefeed

Specifies that any white space characters, except for linefeeds, that immediately precedes or follows a linefeed character, shall be discarded. This action shall take place regardless of the setting of the linefeed-treatment property.

Name:	'white-space-treatment'
Value:	preserve \| collapse \| inherit
Initial:	collapse
Applies to:	all elements
Inherited:	yes
Percentages:	N/A
Media:	visual

The "white-space-treatment" property specifies the treatment of all consecutive white-space (with no exception for linefeed characters, unlike the "space-treatment" property). Values have the following meanings:

preserve

collapse

Specifies, for all the following characters should not be rendered:

· the character is a white space (according to XML), and

· it is not a preserved linefeed (due to linefeed-treatment="preserve" ), and

· the immediately preceding (non-ignored) character is a white-space or the immediately following (non-ignored) character is a preserved linefeed.

Name:	'white-space'
Value:	normal \| pre \| nowrap \| inherit
Initial:	normal
Applies to:	all elements
Inherited:	yes
Percentages:	N/A
Media:	visual

This property declares how white-space inside the element is handled. Setting a value on the 'white-space' property set the respective values on 'wrap-option', 'linefeed-treatment', 'space-treatment' and 'white-space-treatment'.

white-space	wrap-option	linefeed-treatment	space-treatment	white-space-treatment
normal	wrap	auto	preserve	collapse
nowrap	no-wrap	auto	preserve	collapse
pre	no-wrap	preserve	preserve	preserve

Example(s):

The following examples show what whitespace behavior is expected from the PRE and P elements, and the "nowrap" attribute in HTML.

PRE        { white-space: pre }

P          { white-space: normal }

TD[nowrap] { white-space: nowrap }

Conforming user agents may ignore the 'white-space' property in author and user style sheets but must specify a value for it in the default style sheet.