Re: An Aburdly Brief Introduction to Unicode (was Re: Perception ...)

From: Mark Davis (markdavis34@home.com)
Date: Fri Feb 23 2001 - 10:19:42 EST


many comments

----- Original Message -----
From: "Tom Lord" <lord@emf.net>
To: "Unicode List" <unicode@unicode.org>
Sent: Wednesday, February 21, 2001 21:15
Subject: An Aburdly Brief Introduction to Unicode (was Re: Perception ...)

>
> We've seen several posts about the perception that Unicode is a
> 16 bit character set encoding. Among those, we've heard anecdotes
> about the problems people have introducing newcomers to Unicode.
>
> Here is a chapter of a reference manual I've been working on.
> The original manual can be found at http://www.regexps.com, along
> with some useful Unicode software (a fast regular expression matcher,
> a database for C, and some handy data structures).
>
> The manual as a whole is covered by the GNU Free Documentation License
> (http://www.gnu.org), but the plain-text version in this message
> may be reproduced unconditionally.
>
> Thomas Lord
> regexps.com
>
>
> Absurdly Brief Introduction to Unicode
>
> copyright 2001, Thomas Lord, regexps.com, Pittsburgh PA
> Permission is granted to reproduce this text verbatim, without
> further restrictions, except that this copyright notice and
> permission statement must be included. Permission is
> granted to reproduce this text with modifications, provided
> that this copyright notice and permission statement are
> included, and the copy is clearly marked as "modified from
> the original".
>
>
> This chapter is a very succinct introduction to the Unicode character
> set. It may be useful when trying to read this manual, but it is not
> intended to be a thorough introduction. One place to learn more about
> Unicode is the web site of the Unicode Consortium:
> http://www.unicode.org. The current definition of Unicode is published
> as The Unicode Standard Version 3.0 by the Unicode Consortium.
>
> Characters
>
> Unicode defines a set of _abstract_characters_. Roughly speaking,
> abstract characters represent indivisable marks that people use in

'indivisable' is overstating

> writing systems to convey information. In western alphabets, for
> example, latin small letter A is the name of an abstract
> character. That name doesn't refer to a in a particular font, but
> rather to the idea of small A in general.
>
> Unicode includes a number of abstract characters which are formatting
> marks: they give an indication of how adjacent characters should be

not necessarily adjacent

> rendered but do not themselves correspond to what one might ordinarily
> think of as a "written character".
>
> Unicode includes a number of abstract characters which are control
> characters: they have traditional (and sometimes standard) meaning in
> computing, but do not correspond to any feature of human writing.
>
> Unicode includes a number of abstract characters which are usually
> combined with other characters (such as diacritical marks and vowel
> marks).
>
> The goal of Unicode is to encode the complete set of abstract
> characters used in human writing, sufficient to describe all written
> text.
>
> The situation is complicated by three factors: the necessarily large
> size of a global character set; the occaisionaly arbitrary decisions

spell-check

> that must be made about what counts as an abstract character and what
> does not; and the generally acknowledged desirability of supporting
> bijective mappings between a variety of older character sets and

while I like bijective, it is not a commonly understood term.

> subsets of Unicode.
>
> Code Points
>
> A _code_point_ is an integer value which is assigned to an abstract
> character. Each character receives a unique code point.

inaccurate. Multiple *abstract characters* can have a single code point;
multiple code points can correspond to a single *abstract character*.
*Encoded characters* receive a unique code point (but this is a tautology).

>
> By convention, code points are always written in hexadecimal notation,
> prefixed by the string U+. Usually, no less than four hexadecimal
> digits are written.

In Unicode & 10646, at least 4 are required.

>
> Unicode code points are in the closed range U+0000..U+10FFFF. Thus,
> it requires at least 21 bits to hold a Unicode code point. Sometimes

Needs serious qualification. UTF-8 represents many code points with only 8
bits, as does SCSU.

> people say that "Unicode is a 16-bit character set.": that is an
> error.
>
> There are (now and for the forseeable future) many more code points
> than abstract characters. Revisions to Unicode add new characters and,
> sometimes, recommend against using some old characters, but once a
> code point has been "assigned", that assignment never changes.
>
>
> Some Special Code Points
>
> Unicode code points U+0000..U+007F are essentially the same as ASCII
> code points.

Bad way to put it. The code points are just numbers, and the numbers are
exactly the same. Unicode code points U+0000..U+007F represent precisely the
same characters and assignments as ASCII bytes 00..77

>
> Unicode code points U+0000..U+00FF are essentially the same as ISO
> 8859-1 code points ("Latin 1").

ditto
>
> Two code points represent non-characters. These are U+FFFE and
> U+FFFF. Programs are free to give these values special meaning
> internally.

Many do: 66 now!

>
> The code point U+FEFF is assigned to the formatting character
> "zero-width no-break space". This character has a special significance
> when it occurs in certain serialized representations of Unicode
> text. This is described in the next section.
>
> Code points in the range U+D800..U+DFFF are called _surrogates_. They
> are not assigned to abstract characters. Instead, they are used in
> pairs as one way to represent a code point in the range
> U+10000..U+10FFFF. This is also described in the next section.
>
> Encoding Forms
>
> If Unicode code points occupy 21-bits of storage, how is a string of
> Unicode text represented? There are two recommended alternatives
> called UTF-8 and UTF-16. Collectively, systems of representing
> strings are known as _encoding_forms_.
>
> The definition of an encoding form consists of a _code_unit_ (an
> unsigned integer type with a fixed number of bits, usually fewer than 21 )
> and a rule describing a bijective mapping between code points and

bijective

> sequences of code units. UTF-8 uses 8-bit code units. UTF-16 uses 16
> bit code units.

Add UTF-32

>
> In UTF-8, code points in the range U+0000..U+007F are stored in a
> single code unit (one byte). Other code points are represented by a
> sequence of two or more code units, each byte in the range 80..FF. The
> details of these multi-byte sequences are available in countless
> Unicode reference materials.
>
> In UTF-16, code points in the range U+0000..U+FFFF are stored in a
> single 16-bit code unit. Other code points are represented by a pair
> of surrogates, each stored in one code unit. Again, the details of
> multi-code-unit sequences are readily available elsewhere.
>
> Not every sequence of 8-bit values is a valid UTF-8 string. Not every
> sequence of 16-bit values is a valid UTF-16 string. Strings that are
> not valid are called "ill-formed".
>
> When stored in the memory of a running program, UTF-16 code units are
> almost certainly stored in the native byte order of the machine. In

almost always. 'almost certainly' sounds like a quantum state.

> files and when transmitted, two byte orders are possible. When byte
> order distinctions are important, the names UTF-16be (big-endian) and
> UTF-16le (little-endian) are used.

BE, LE

>
> When a stream of text has a UTF-16 encoding form, and when its byte
> order is not known in advance, it is marked with a byte order mark. A
> byte order mark is the formatting character "zero-width no-break
> space" (U+FEFF ) occuring as the first character in the stream. By
> examining the first two bytes of such a stream, and assuming that
> those bytes are a byte order mark, programs can determine the
> byte-order of code units within the stream. When a byte order mark is
> present, it is not considered to be part of the text which it marks.
>
> Another encoding form has been standardized that may become popular in
> the future: UTF-32. In UTF-32, code units are 32 bits and each code
> point is stored in a single code unit.

No need to put this at end.

>
> Character Properties
>
> In addition to naming a set of abstract characters, and assigning
> those characters to code points, the definition of Unicode assigns
> each character a collection of _character_properties_.
>
> The possible properties a character may have and their meanings are
> too numerous to list here. Three examples are:
>
> general category -- such as "lowercase letter", "uppercase letter",
> "decimal digit", etc.
>
> decimal digit value -- if the character is used as a decimal digit,
> this property is its numeric value.
>
> case mappings -- the default lowercase character corresponding to an
> uppercase character, and so forth.
>
> The Unicode consortium publishes definitions of various character
> properties and distributes text files listing those properties for
> each code point. For more information, visit http://www.unicode.org.
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:19 EDT