[Unicode]  Frequently Asked Questions Home | Site Map | Search

Specifications

Q: The Unicode Standard and related standards contain a number of specifications or guidelines for dealing with different programming tasks. Sometimes it's hard to find these. Is there a central place to look?

A. The following table provides a list of areas where the Unicode Consortium provides specifications, with a shorthand description of each. [MD]

General
Character Properties: common properties such as Name, Alphabetic, Letter, White-Space, General Category, Default-Ignorable, plus those used in other specifications Ch 4
Character Properties for CJK Ideographs: property information specific to CJK ideographs and character properties UAX 38
Unicode Character Database: general documentation about the UCD UAX 44
UCD in XML: description of the XML representation of the UCD UAX 42
Case Operations: conversion/detection of Upper/Lower/Titlecase, case folding, case matching. See also 4.2 Case. § 3.13
Characters with Unusual Properties: characters that implementers need to pay special attention to § 4.11
Use of Characters in Markup Contexts: guidelines for XML and other markup languages UTR 20
Script Names: usage model for determining text runs in a given script UAX 24
Use of Characters in Mathematical Contexts: guidelines for mathematical usage UTR 25
Unicode Named Character Sequences: specifies the syntax for named character sequences UAX 34
Encodings
Unicode Encoding Forms: UTF-8, UTF-16, UTF-32 conversion and validation § 3.9
Unicode Encoding Schemes: UTF-8, UTF-16 (BE/LE), UTF-32 (BE/LE) conversion and validation § 3.10
Binary Order: UTF-8 order vs. UTF-16 order § 5.17
Character Mapping Markup Language: mapping Unicode to and from legacy code pages UTS 22
A Standard Compression Scheme for Unicode: how to compress Unicode to about the same size as legacy UTS 6
UTF-EBCDIC: encapsulating Unicode on EBCDIC systems UTR 16
Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8): a compatibility 8-bit encoding scheme UTR 26
Ideographic Variation Database: repository of variation sequences for specified collections of Han glyphs UTS 37
Comparison
Canonical Equivalence: when character sequences are equivalent; canonical ordering § 3.11
Unicode Normalization Forms: how to normalize text for comparison UAX 15
Unicode Collation Algorithm: the default mechanism for comparing, searching, and matching Unicode text UTS 10
Parsing
Hangul Syllables: boundaries, parsing, (de/)composition, names § 3.12
Decimal Numbers: conversion and validation § 5.5
Unicode Regular Expression Guidelines: the features required in supporting regular expressions with Unicode UTS 18
Identifier and Pattern Syntax: how to parse identifiers. UAX 31
Language Information in Plain Text, also 16.9 Tag Characters § 5.10
Variation Selectors: usage, validation § 16.4
Ideographic Description Sequences: use, validation § 12.2
Segmentation
Newline Guidelines: how to handle newline characters § 5.8
Line Breaking Algorithm: the default way to determine where to linewrap UAX 14
Text Segmentation: the default way to break text into user characters, words, and sentences UAX 29
Rendering
The Bidirectional Algorithm: required for display of Arabic and Hebrew text UAX 9
East Asian Width: the default determination of character width in East Asian contexts UAX 11
Minimal shaping requirements for Arabic, Devanagari, Tamil, etc. Ch 8-10
Locale Data
Locale Data Mark-up Language (LDML): used for Interchange of locale data used for internationalization UTS 35
Common Locale Data Repository (CLDR): a repository of LDML data for hundreds of locales CLDR
Security
Unicode Security Considerations: guidelines for recognizing Unicode security problems and dealing with them UTR 36
Unicode Security Mechanisms: useful tools for detecting spoofs UTS 39

Q. Are all of these normative?

A. No. Some are normative and others are informative. For sections of The Unicode Standard, the material in Chapter 3, Conformance, and most of Chapter 4, Character Properties, are normative, while material in other sections is generally informative. The Unicode Standard Annexes (UAX) are formally a part of the Unicode Standard, and most of the material in them is normative, unless otherwise indicated in the annex itself. For Unicode Technical Standards (UTS), the specifications are normative parts of those independent standards. Unicode Technical Reports (UTR) contain informative material. For more information see About Unicode Technical Reports. [MD, KW]