NCITS/L2/01- 319R

                               Date: September 10, 2001

 

 

 

Title:

White-space processing

Source:

Michel Suignard (Microsoft)

Action:

FYI

Distribution

NCITS/L2 and UTC

 

 

Following are two documents extract.

The first one comes from XHTML Modularization http://www.w3.org/TR/xhtml-modularization/  and expresses the conformance applying to whitespace processing. It has been augmented with an additional note (Note 1) which is not yet part of XHTML but has been worked on by the W3C I18n Group. There are still some issues about the meaning of the ‘removal’ of white space characters. Are they removed from the infoset, or are they just removed from a rendering point of view?

The second one comes from a CSS3 Text module (link not yet public) which describes a proposal for white-space processing in the context of future versions of CSS.

These two extracts hint at further specification work in classifying Unicode characters according to a white-space processing. The author and the relevant W3C WG would like to get preliminary feedback on the current text before engaging in further development in these matters.

 

Michel Suignard

XHTML Modularization

3.5. XHTML Family User Agent Conformance

1…

9. A conforming user agent must meet all of the following criteria (as defined in [XHTML1]):

The XML processor normalizes different systems' line end codes into one single LINE FEED character, that is passed up to the application.

The user agent must process white space characters in the data received from the XML processor as follows:

White space in attribute values is processed according to [XML].

In determining how to convert a LINE FEED character a user agent must meet the following rules, whereby the script of characters on either side of the LINE FEED determines the choice of the replacement. The assignment of script names to all characters is done in accordance to the Unicode [UNICODE] technical report TR#24 (Script Names).

1.                    COMMON script characters (such as punctuation) are treated the same as if they belong to the same script as the character on the other side.

2.                    INHERITED script characters (such as combining characters) are treated as if they belong to the same script as the previous character if preceding the LINE FEED character or the same script as the character on the preceding side if following the LINE FEED character.

3.                    If the characters preceding and following the LINE FEED character belong to HAN, HIRAGANA or KATAKANA script, the LINE FEED must be converted into no character.

4.                    If the characters preceding and following the LINE FEED character belong to KHMER, LAO, MYANMAR or THAI script, the LINE FEED must be converted into a ZERO WIDTH SPACE character (​) or no character. (This rule may be extended in the future to additional scripts, not yet encoded, that do not separate their words by space characters)

5.                    If none of the conditions in (3) through (4) are true, the LINE FEED character must be converted into a SPACE character. This covers the case of many scripts like LATIN, CYRILLIC, GREEK, etc. This also covers the case that only the COMMON script has been detected on both sides, and the case that scripts belong to different categories.

Note (informative): Some scripts, such as HAN, HIRAGANA, KATAKANA, KHMER, LAO, MYANMAR, THAI do not use space characters for word boundary delimitation, but may still use these space characters for delimitation of sentences or fragments of sentence. If such a character occurs as the last character before a LINE FEED character, or a character following a LINE FEED character, it may be eliminated by the white space processing described above. Several solutions are possible:

CSS3

8.1. Text wrapping: the 'wrap-option' property

Name:

'wrap-option'

Value:

no-wrap | wrap | inherit

Initial:

wrap

Applies to:

all elements

Inherited:

yes

Percentages:

N/A

Media:

visual

This property controls whether or not text wraps when it reaches the flow edge of its containing block box

wrap

Line-breaking occurs if the line overflows the available block width. The specific line breaking algorithm is determined by the 'line-break' and word-break' properties.

no-wrap

No line wrapping is performed. In the case when lines are longer than the available block width, the overflow will be treated in accordance with the 'overflow' property specified in the element.

8.2. White-space control: the 'linefeed-treatment', 'space-treatment', 'white-space-treatment' properties and the 'white-space' shortcut property

White-space processing in the context of CSS is the mechanism by which all white-space characters are interpreted for rendering purpose. The white-space set is determined by the XML specification as being a combination of one or more space characters (Unicode value U+0020), carriage returns (U+000D), line feeds (U+000A), or tabs (U+0009).

The amount of white space processing that can be achieved by a user agent that supports CSS is directly related to the CSS processing model, especially the document parsing and validation. After parsing and possible validation, the document tree may contain text nodes that contain unprocessed white space characters, or the document tree may already have been processed in a way that white space characters have been collapsed and partially removed (white space normalization).

In that respect, the CSS properties related to white space processing can only be effective if the CSS processor has access to the white space characters that were originally encoded in the document. However, end-of-line characters are typically handled (like by XML processors) in such a way that any arbitrary combination of end-of-line characters is replaced by a single line feed character (U+000A).

Note: XML Schema, through its 'whiteSpace' facet can constrain exactly the type of white space characters still available to a rendering process like CSS for elements containing string datatype. In addition, some XML languages like XHTML may have their own white-space processing rules when parsing and validating documents with white-space characters. Therefore, some of the behaviors described below may be affected by these limitations and may be user agent dependent in these contexts.

The typical white-space processing, similar to XHTML-MOD is as follows:

Note: These rendering rules make no assumption about the storage model of these white-space character sequences. It is outside the scope of CSS to determine the character code values accessible through programming interface such as DOM.

The following properties: 'linefeed-treatment', 'space-treatment' and 'white-space-treatment' allow precise controls of that typical behavior. The 'linefeed-treatment' determines the rendering of the line feed characters. The 'space-treatment' determines the rendering of white space character (except line feed). And the 'white-space-collapse' property determines the treatment of consecutive white-space characters after consideration of the two prior properties. The 'white-space' property is redefined as a shortcut property which sets the values of these three new properties as well as the 'wrap-option' property (the latter for compatibility reason with earlier versions of CSS).

Name:

'linefeed-treatment'

Value:

auto | ignore | preserve | treat-as-space | treat-as-zero-width-space | inherit

Initial:

treat-as-space

Applies to:

all elements

Inherited:

yes

Percentages:

N/A

Media:

visual

This property specifies the treatment of linefeeds (U+000A characters). Values have the following meanings:

auto

Linefeed characters are transformed for rendering purpose into one of the following characters: a space character, a zero width space character (U+200B), or no character (i.e. not rendered). The choice of the resulting character is conditioned by the script property of the characters preceding and following the line feed character in the same line flow elements part of the same block element. The result of the transformation can be treated by subsequent CSS processing (including white space collapsing).

ignore

Linefeed characters are ignored. i.e. they are transformed for rendering purpose into no character.

preserve

Linefeed characters indicate a an end of line of boundary.

treat-as-space

Linefeed characters are transformed for rendering purpose into a space character (U+0020). The result of the transformation can be treated by subsequent CSS processing (including white space collapsing).

treat-as-zero-width-space

Linefeed characters are transformed for rendering purpose into a zero width space character (U+200B). The result of the transformation can be treated by subsequent CSS processing (including white space collapsing).

Note: The Unicode Standard recommends that the zero width space is considered a valid line-break point and that if two characters with a zero width space in between are placed on the same line they are placed with no space between them and that if they are placed on two lines no additional glyph area, such as for a hyphen, is created at the line-break.

Name:

'space-treatment'

Value:

ignore | preserve ignore-if-before-linefeed | ignore-if-after-linefeed | 
ignore-if-surrounding-linefeed | inherit

Initial:

preserve

Applies to:

all elements

Inherited:

yes

Percentages:

N/A

Media:

visual

This property specifies the treatment of space (U+0020) and other white space characters except for linefeeds (U+000A), since their treatment is determine by the linefeed-treatment property. Values have the following meanings:

ignore

White space characters, except for linefeeds, are ignored. i.e. they are transformed for rendering purpose into no character.

preserve

All white space characters are rendered as intended. The tab character (U+0009) is rendered as the smallest non-zero number of spaces necessary to line characters up along tab stops that are every 8 characters. The treatment of linefeeds is not determined by this property.

ignore-if-before-linefeed

Specifies that any white space characters, except for linefeeds, that immediately precedes a linefeed character, shall be discarded. This action shall take place regardless of the setting of the linefeed-treatment property.

ignore-if-after-linefeed

Specifies that any white space characters, except for linefeeds, that immediately follows a linefeed character, shall be discarded. This action shall take place regardless of the setting of the linefeed-treatment property.

ignore-if-surrounding-linefeed

Specifies that any white space characters, except for linefeeds, that immediately precedes or follows a linefeed character, shall be discarded. This action shall take place regardless of the setting of the linefeed-treatment property.

Name:

'white-space-treatment'

Value:

preserve | collapse | inherit

Initial:

collapse

Applies to:

all elements

Inherited:

yes

Percentages:

N/A

Media:

visual

The "white-space-treatment" property specifies the treatment of all consecutive white-space (with no exception for linefeed characters, unlike the "space-treatment" property). Values have the following meanings:

preserve

All white space characters are rendered as intended. The tab character (U+0009) is rendered as the smallest non-zero number of spaces necessary to line characters up along tab stops that are every 8 characters.

collapse

Specifies, for all the following characters should not be rendered:

·         the character is a white space (according to XML), and

·         it is not a preserved linefeed (due to linefeed-treatment="preserve" ), and

·         the immediately preceding (non-ignored) character is a white-space or the immediately following (non-ignored) character is a preserved linefeed.

Name:

'white-space'

Value:

normal | pre | nowrap | inherit

Initial:

normal

Applies to:

all elements

Inherited:

yes

Percentages:

N/A

Media:

visual

This property declares how white-space inside the element is handled. Setting a value on the 'white-space' property set the respective values on 'wrap-option', 'linefeed-treatment', 'space-treatment' and 'white-space-treatment'.

white-space

wrap-option

linefeed-treatment

space-treatment

white-space-treatment

normal

wrap

auto

preserve

collapse

nowrap

no-wrap

auto

preserve

collapse

pre

no-wrap

preserve

preserve

preserve

Example(s):

The following examples show what whitespace behavior is expected from the PRE and P elements, and the "nowrap" attribute in HTML.

PRE        { white-space: pre }
P          { white-space: normal }
TD[nowrap] { white-space: nowrap }

Conforming user agents may ignore the 'white-space' property in author and user style sheets but must specify a value for it in the default style sheet.