Preface

#Why Unicode?

The Unicode Standard and its associated specifications provide programmers with a single universal character encoding, extensive descriptions, and a vast amount of data about how characters function. The specifications and data describe how to form words and break lines; how to sort text in different languages; how to format numbers, dates, times, and other elements appropriate to different languages; how to display languages whose written form flows from right to left, such as Arabic and Hebrew, or whose written form splits, combines, and reorders, such as languages of South Asia. These specifications include descriptions of how to deal with security concerns regarding the many “look-alike” characters from alphabets around the world. Without the properties and algorithms in the Unicode Standard and its associated specifications, interoperability between different implementations would be impossible, and much of the vast breadth of the world’s languages would lie outside the reach of modern software.

#Organization of This Standard

This core specification, together with the Unicode code charts, the Unicode Character Database, and the Unicode Standard Annexes, defines the Unicode Standard. The core specification contains the general principles, requirements for conformance, and guidelines for implementers. The character code charts and names are available online.

#Concepts, Architecture, Conformance, and Guidelines. The first five chapters introduce the Unicode Standard and provide the fundamental information needed to produce a conforming implementation. Basic text processing, working with combining marks, encoding forms, and normalization are all described. A special chapter on implementation guidelines answers many common questions that arise when implementing Unicode.

Chapter 1 introduces the standard’s basic concepts, design basis, and coverage and discusses basic text handling requirements.

Chapter 2 sets forth the fundamental principles underlying the Unicode Standard and covers specific topics such as text processes, overall character properties, and the use of combining marks.

Chapter 3 constitutes the formal statement of conformance. This chapter also presents the normative algorithms for several processes, including normalization, Korean syllable boundary determination, and default casing.

Chapter 4 describes character properties in detail, both normative (required) and informative. Additional character property information appears in Unicode Standard Annex #44, “Unicode Character Database.”

Chapter 5 discusses implementation issues, including compression, strategies for dealing with unknown and unsupported characters, and transcoding to other standards.

#Character Block Descriptions. Chapters 6 through 23 contain the character block descriptions that provide basic information about each script or group of symbols and may discuss specific characters or pertinent layout information. Some of this information is required to produce conformant implementations of these scripts and other collections of characters.

#Code Charts. Chapter 24 describes the conventions used in the code charts and the list of character names. The code charts contain the normative character encoding assignments, and the names list contains normative information, as well as useful cross references and informational notes.

#Appendices. The appendices contain additional information.

Appendix A documents the notational conventions used by the standard.

Appendix B provides information about Unicode publications and links to other important Unicode resources.

Appendix C details the relationship between the Unicode Standard and ISO/IEC 10646.

Appendix D lists version history.

Appendix E describes the history of Han unification in the Unicode Standard.

Appendix F provides additional documentation for characters encoded in the CJK Strokes block (U+31C0..U+31EF).

#Online Information. A glossary of Unicode terms, the Unicode Character Name Index, and the list of references for the Unicode Standard are located at:

https://www.unicode.org/glossary/

https://www.unicode.org/charts/charindex.html

https://www.unicode.org/references/

#The Unicode Character Database

The Unicode Character Database (UCD) is a collection of data files containing character code points, character names, and character property data. It is described more fully in Section 4.1, Unicode Character Database and in Unicode Standard Annex #44, “Unicode Character Database.” All versions, including the most up-to-date version of the Unicode Character Database, are found at:

https://www.unicode.org/ucd/

Information on versioning and on all versions of the Unicode Standard can be found at:

https://www.unicode.org/versions/

#Unicode Code Charts

The Unicode code charts contain the character encoding assignments and the names list. The archival, reference set of versioned 16.0 code charts may be found at:

https://www.unicode.org/charts/PDF/Unicode-16.0/

For easy lookup of characters, see the current code charts:

https://www.unicode.org/charts/

An interactive radical-stroke index to CJK ideographs is located at:

https://www.unicode.org/charts/unihanrsindex.html

#Unicode Standard Annexes

The Unicode Standard Annexes form an integral part of the Unicode Standard. Conformance to a version of the Unicode Standard includes conformance to its Unicode Standard Annexes. All versions, including the most up-to-date versions of all Unicode Standard Annexes, are available at:

https://www.unicode.org/reports/index.html#annexes

The following is the list of Unicode Standard Annexes:

Unicode Standard Annex #9, “Unicode Bidirectional Algorithm,” describes specifications for the positioning of characters in text containing characters flowing from right to left, such as Arabic or Hebrew.

Unicode Standard Annex #11, “East Asian Width,” presents the specification of an informative property for Unicode characters that is useful when interoperating with East Asian legacy character sets.

Unicode Standard Annex #14, “Unicode Line Breaking Algorithm,” presents the specification of line breaking properties for Unicode characters.

Unicode Standard Annex #15, “Unicode Normalization Forms,” describes Unicode normalization and provides examples and implementation strategies for it.

Unicode Standard Annex #24, “Unicode Script Property,” describes two related Unicode code point properties. Both properties share the use of Script property values. The Script property itself assigns single script values to all Unicode code points, identifying a primary script association, where possible. The Script_Extensions property assigns sets of Script property values, providing more detail for cases where characters are commonly used with multiple scripts.

Unicode Standard Annex #29, “Unicode Text Segmentation,” describes algorithms for determining default boundaries between certain significant text elements: grapheme clusters (“user-perceived characters”), words, and sentences.

Unicode Standard Annex #31, “Unicode Identifiers and Syntax,” describes specifications for recommended defaults for the use of Unicode in the definitions of identifiers and in pattern-based syntax.

Unicode Standard Annex #34, “Unicode Named Character Sequences,” defines the concept of a Unicode named character sequence.

Unicode Standard Annex #38, “Unicode Han Database (Unihan),” describes the organization and content of the Unihan Database.

Unicode Standard Annex #41, “Common References for Unicode Standard Annexes,” contains the listing of references shared by other Unicode Standard Annexes.

Unicode Standard Annex #42, “Unicode Character Database in XML,” describes an XML representation of the Unicode Character Database.

Unicode Standard Annex #44, “Unicode Character Database,” provides the core documentation for the Unicode Character Database (UCD). It describes the layout and organization of the Unicode Character Database and how the UCD specifies the formal definition of Unicode character properties.

Unicode Standard Annex #45, “U-Source Ideographs,” describes U-source ideographs as used by the Ideographic Research Group (IRG) in its CJK ideograph unification work.

Unicode Standard Annex #50, “Unicode Vertical Text Layout,” describes the Unicode character property, Vertical_Orientation, which can serve as a stable default orientation for characters for reliable document interchange.

Unicode Standard Annex #53, “Unicode Arabic Mark Rendering,” specifies an algorithm that can be utilized during rendering for determining correct display of Arabic combining mark sequences.

Unicode Standard Annex #57, “Unicode Egyptian Hieroglyph Database (Unikemet),” describes the organization and content of the Unikemet Database.

#Unicode Technical Standards and Unicode Technical Reports

Unicode Technical Reports and Unicode Technical Standards are separate publications and do not form part of the Unicode Standard. However, several Unicode Technical Standards are versioned synchronously with the Unicode Standard and have newly published versions:

Unicode Technical Standard #10, “Unicode Collation Algorithm,” details how to compare two Unicode strings while remaining conformant to the requirements of the Unicode Standard. It includes the Default Unicode Collation Element Table (DUCET) and conformance tests.

Unicode Technical Standard #39, “Unicode Security Mechanisms,” specifies mechanisms that can be used to detect possible security problems involving Unicode characters. It includes data tables for confusable characters.

Unicode Technical Standard #46, “Unicode IDNA Compatibility Processing,” discusses compatibility between IDNA 2003, IDNA 2008, and current browser practice for domain names. It provides a comprehensive mapping to support current user expectations for casing and other variants of domain names.

Unicode Technical Standard #51, “Unicode Emoji,” defines the structure of Unicode emoji characters and sequences, and provides data to support that structure, such as which characters are considered to be emoji, and which emoji should be displayed by default with a text style versus an emoji style. It also provides design guidelines for improving the interoperability of emoji characters across platforms and implementations.

All versions of all Unicode Technical Reports and Unicode Technical Standards are available at:

https://www.unicode.org/reports/

#Updates and Errata

Reports of errors in the Unicode Standard, including the Unicode Character Database and the Unicode Standard Annexes, may be reported using the reporting form:

https://corp.unicode.org/reporting/error.html

A list of known errata is maintained at:

https://www.unicode.org/errata/

Any currently listed errata will be fixed in subsequent versions of the standard.

#Acknowledgements

The Unicode Standard is the result of the dedication and contributions of numerous people over many years. We would like to acknowledge the individuals whose contributions were central to the design, authorship, and review of this standard. A complete listing of acknowledgements can be found at:

https://www.unicode.org/acknowledgements/standard.html

There is also a page dedicated specifically to acknowledgement of contributors of the many fonts used in production of the Unicode Standard:

https://www.unicode.org/charts/fonts.html

Current editorial contributors can be found at:

https://www.unicode.org/consortium/edcom.html

#About This Publication

The core specification is built as a static website with the Astro framework and Svelte components. The archival PDF version is generated with WeasyPrint. Example glyphs are shaped with harfbuzzjs. The text is mainly set in STIX Two Text. Most of the figures were created with Adobe Illustrator.

The Unicode code charts were produced with Unibook chart formatting software supplied by ASMUS, Inc.