L2/01-328R3
|
|
|
Version |
Unicode
3.1.1 |
Authors |
Toby
Phipps (tphipps@peoplesoft.com) |
Date |
2001-09-04 |
This
Version |
|
Previous
Version |
None |
Latest
Version |
This document specifies a 8-bit Compatibility Encoding Scheme for UTF-16
(CESU) that is intended as an alternate encoding to UTF-8 for internal use
within systems processing Unicode in order to provide a ASCII-compatible 8-bit
encoding that preserves UTF-16 binary collation. It is not intended nor recommended as an encoding used for
open information exchange. The
Unicode Consortuim, does not encourage the use of CESU-8, but does recognize
the existence of data in this encoding and supplies this Technical Report to
clearly define the format and to distinguish it from UTF-8. This encoding does not replace or amend the
definition of UTF-8.
This document has been approved by the Unicode Technical Committee
for public review as a Proposed Draft Unicode Technical Report.
Publication does not imply endorsement by the Unicode Consortium. This is a
draft document which may be updated, replaced, or superseded by other documents
at any time. This is not a stable document; it is inappropriate to cite this
document as other than a work in progress.
A list of current Unicode Technical Reports is found on http://www.unicode.org/unicode/reports/.
For more information about versions of the Unicode Standard, see http://www.unicode.org/unicode/standard/versions/.
Please mail corrigenda and other comments to the author(s).
CESU-8 defines an encoding scheme for Unicode identical to UTF-8
except for its representation of supplementary characters. In CESU-8, supplementary characters are
represented as six-byte sequences resulting from the transformation of each
UTF-16 surrogate code unit into an eight-bit form similar to the UTF-8 transformation,
but without first converting the input surrogate pairs to a scalar value.
CESU-8 is useful in 8-bit processing environments where binary
collation with UTF-16 is required. It
is designed and recommended for use only within products requiring this UTF-16
binary collation eqivalence. It is not intended nor recommended for open
interchange.
The following lists the important features of this encoding form:
As a very small percentage of characters in a typical data stream
are expected to be supplementary characters, there is a strong possibility that
CESU-8 data may be misinterpreted as UTF-8.
Therefore, all use of CESU-8 outside closed implementations is strongly
discouraged, such as the emittance of CESU-8 in output files, markup language
or other open transmission forms.
The following define the CESU-8 encoding scheme. CESU-8 is not a normative part of The
Unicode Standard, and therefore the definitions below do not form part of the
standard. Instead, they are
encapsulated in this Unicode Technical Report as an implementation-specific
transformation form for use by implementors of The Unicode Standard.
2.1 |
(a)
CESU-8 is a Compatibility Encoding Scheme for UTF-16 (CESU) that
serializes a Unicode code point as a sequence of one, two, three or six
bytes.
|
||||||||||||||||
2.2 |
CESU-8
Bit Distribution
|
ISO/IEC 10646 and The Unicode Standard define the UTF-8 encoding
form, which is very similar in definition to CESU-8 other than its treatment of
supplementary characters. CESU-8 is an
additional encoding scheme that supplements these definitions, but does not
form part of either ISO/IEC 10646 or The Unicode Standard. It is intended only for use in compatibility
situations where binary collation with UTF-16 is required.
CESU-8 will be registered with the Internet Assigned Numbers
Authority. This section will be updated
with the IANA registered name.
Note: CESU-8 was originally proposed and discussed with the name
UTF-8S, but was renamed CESU-8 by recommendation from the Unicode Technical
Committee to avoid possible confusion with UTF-8.
[Reports] |
Unicode
Technical Reports |
[Versions] |
Versions
of the Unicode Standard |
The following summarizes modifications from the previous version
of this document.
1 |
|
Copyright © 1999-2001 Unicode,
Inc. All Rights Reserved.
The Unicode Consortium makes no
expressed or implied warranty of any kind, and assumes no liability for errors
or omissions. No liability is assumed for incidental and consequential damages
in connection with or arising out of the use of the information or programs
contained or accompanying this technical report.
Unicode and the Unicode logo are
trademarks of Unicode, Inc., and are registered in some jurisdictions.