Unicode Technical Report #19

UTF-32

Version	6.1
Authors	Mark Davis (mark.davis@us.ibm.com, home)
Date	2000-08-31
This Version	http://www.unicode.org/unicode/reports/tr19/tr19-6.1
Previous Version	http://www.unicode.org/unicode/reports/tr19/tr19-6.html
Latest Version	http://www.unicode.org/unicode/reports/tr19

Summary

This document specifies a Unicode transformation format that provides serializes a Unicode codepoint as a sequence of four bytes. It provides a name that can be used to refer to the subset of ISO/IEC 10646 UCS-4 values that are available Unicode code points, from U+0000 to U+10FFFF.

Status

This document has been reviewed by Unicode members and other interested parties, and has been approved by the Unicode Technical Committee as a Unicode Technical Report. It is a stable document and may be used as reference material or cited as a normative reference from another document.

A Unicode Technical Report (UTR) may contain either informative material or normative specifications, or both. Each UTR may specify a base version of the Unicode Standard. In that case, conformance to the UTR requires conformance to that version or higher.

A list of current Unicode Technical Reports is found on http://www.unicode.org/unicode/reports/. For more information about versions of the Unicode Standard, see http://www.unicode.org/unicode/standard/versions/ .
Please mail corrigenda and other comments to the author(s).

The preferred encoding form for Unicode text is the 16-bit form: UTF-16. There is also an 8-bit encoding form called UTF-8 that can be used to represent Unicode in environments where the 16-bit form is impractical due to compatibility constraints. In addition, some implementations may wish to use a 32-bit form, where each Unicode code point (aka scalar value) corresponds to a single 32-bit unit. Even those applications that do not use this form may want to convert to and from it for interoperability.

The following lists the important features of this encoding form:

UTF-32 is restricted in values to the range 0..10FFFF₁₆, which precisely matches the range of characters defined in the Unicode Standard (and other standards such as XML).
Over and above ISO 10646, the Unicode Standard adds a number of conformance constraints on character semantics (see The Unicode Standard, Version 3.0, Chapter 3). Declaring UTF-32 instead of UCS-4 allows implementations to explicitly commit to Unicode semantics.
The term UTF-32 is parallel to UTF-16 and UTF-8, avoiding some confusion among software developers — especially since the pronunciations of "UTF" and "UCS" are so very similar.

Definitions

UTF-32BE is the Unicode Transformation Format that serializes a Unicode scalar value as a sequence of four bytes, in big-endian format. An initial sequence corresponding to U+FEFF is interpreted as a zero width no-break space.
- In UTF-32BE, <004D 0061 D800 DC00> is serialized as <00 00 00 4D 00 00 00 61 00 01 00 00>

UTF-32LE is the Unicode Transformation Format that serializes a Unicode scalar value as a sequence of four bytes, in little-endian format. An initial sequence corresponding to U+FEFF is interpreted as a zero width no-break space.
- In UTF-32LE, <004D 0061 D800 DC00> is serialized as <4D 00 00 00 61 00 00 00 00 00 01 00>

UTF-32 is the Unicode Transformation Format that serializes a Unicode scalar value as a sequence of four bytes, in either big-endian or little-endian format. An initial sequence corresponding to U+FEFF is interpreted as a byte order mark: it is used to distinguish between the two byte orders. The byte order mark is not considered part of the content of the text. A serialization of Unicode scalar values into UTF-32 may or may not begin with a byte order mark.
- In UTF-32BE, <004D 0061 D800 DC00> is serialized as <00 00 FE FF 00 00 00 4D 00 00 00 61 00 01 00 00>, <FF FE 00 00 4D 00 00 00 61 00 00 00 00 00 01 00> or <00 00 00 4D 00 00 00 61 00 01 00 00>
- The term UTF-32 can be used ambiguously. When referring to the encoding of Unicode in memory, there is no associated byte orientation, and a BOM is never used. When referring to a serialization of Unicode scalar values into bytes, it may have a BOM and either byte orientation.

Relation to ISO/IEC 10646 and UCS-4

ISO/IEC 10646 defines a 4-byte encoding form called UCS-4. Since UTF-32 is simply a subset of UCS-4 characters, it is conformant to ISO/IEC 10646 as well as to the Unicode Standard.

As of the recent publication of the second edition of ISO/IEC 10646-1, UCS-4 still assigns private use codepoints (E00000₁₆..FFFFFF₁₆ and 60000000₁₆..7FFFFFFF₁₆) that are not in the range of valid Unicode codepoints. To promote interoperability among the Unicode encoding forms JTC1/SC2/WG2 has approved a motion removing those private use assignments:

Resolution M38.6 (Restriction of encoding space) [adopted unanimously]

"WG2 accepts the proposal in document N2175 towards removing the provision for Private Use Groups and Planes beyond Plane 16 in ISO/IEC 10646, to ensure internal consistency in the standard between UCS-4, UTF-8 and UTF-16 encoding formats, and instructs its project editor [to] prepare suitable text for processing as a future Technical Corrigendum or an Amendment to 10646-1:2000."

While this resolution must still be turned into a Technical Corrigendum or an Amendment to 10646-1:2000, the Unicode Technical Committee has every expectation that once the text for that Technical Corrigendum or Amendment starts its formal balloting it will proceed smoothly to formal approval and publication as part of that standard.

Until the formal balloting is concluded, the term UTF-32 can be used to refer to the subset of UCS-4 characters that are in the range of valid Unicode code points. After it passes, UTF-32 will then simply be an alias for UCS-4 (with the extra requirement that Unicode semantics are observed).

The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.