Draft Unicode Technical Report #19

UTF-32

Revision	5.0
Authors	Mark Davis (mark.davis@us.ibm.com)
Date	1999-11-16
This Version	http://www.unicode.org/unicode/reports/tr19/tr19-5.html
Previous Version	http://www.unicode.org/unicode/reports/tr19/tr19-4.html
Latest Version	http://www.unicode.org/unicode/reports/tr19

Summary

This document specifies an alias that can be used to refer to the subset of UCS-4 values that are valid Unicode code points.

Status

This document contains material which has been considered and approved by the Unicode Technical Committee for publication as a Draft Technical Report. At the current time, the specifications in this technical report are provided as information and guidance to implementers of the Unicode Standard, but do not form part of the standard itself. The Unicode Technical Committee may decide to incorporate all or part of the material of this technical report into a future version of the Unicode Standard, either as informative material or as normative specification. Please mail corrigenda and other comments to the author.

The content of all technical reports must be understood in the context of the appropriate version of the Unicode Standard. References in this technical report to sections of the Unicode Standard refer to the Unicode Standard, Version 3.0. See http://www.unicode.org/unicode/standard/versions for more information.

The preferred encoding form for Unicode text is the 16-bit form, UTF-16. There is also an 8-bit encoding form called UTF-8 that can be used to represent Unicode in environments where the 16-bit form is impractical due to compatibility constraints. In addition, some implementations may wish to use a 32-bit form, where each Unicode code point (aka scalar value) corresponds to a single 32-bit unit. Even those applications that do not use this form may want to convert to and from it for interoperability.

Such an encoding form is defined in ISO/IEC 10646, and called UCS-4. However, UCS-4 permits values that are not in the range of valid Unicode code points. The term UTF-32 can be used to refer to the subset of UCS-4 characters that are in the range of valid Unicode code points. The following lists the important features of this encoding form:

UTF-32 is restricted in values to the range 0..10FFFF₁₆, which precisely matches the range of characters defined in the Unicode Standard (and other standards such as XML).
- While both the Unicode consortium and JTC1/SC2/WG2 do not ever expect to assign characters above 10FFFF₁₆, currently UCS-4 does not formally restrict the assignment of future characters above that limit.
- Although not recommended, in UCS-4 the code ranges E00000₁₆..FFFFFF₁₆ and 60000000₁₆..7FFFFFFF₁₆ are defined for private use, and legal in interchange.
Over and above ISO 10646, the Unicode Standard adds a number of conformance constraints on character semantics (see The Unicode Standard, Version 2.0, Chapter 3). Declaring UTF-32 instead of UCS-4 allows implementations to explicitly commit to Unicode semantics.
The term UTF-32 is parallel to UTF-16 and UTF-8, avoiding some confusion among software developers — especially since the pronunciations of "UTF" and "UCS" are so very similar.

Since UTF-32 is simply a subset of UCS-4 characters, it is conformant to ISO/IEC 10646 as well as to the Unicode Standard.

The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.

Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.