From: Hans Aberg (haberg@math.su.se)
Date: Thu Apr 28 2005 - 15:17:56 CST
At 18:43 -1000 2005/04/27, Sivakatirswami wrote:
>OK, so my questions are:
>
>1) is the decimal expression for the capital letter A as 65
>exactly correspondent to its integer code point position in the
>total unicode series expressed as as a series of integers?
>
>2) How can one ascertain the integer number for a code point
>outside-above base ANSI?
Unicode lacks a clear underlying mathematical model, or at least a
clear description of it. So here is what it should be, regardless
what it actually is :-):
First there is a intuitive notion of an "abstract character". Unicode
tries to collect such abstract characters; most often they are just
called "characters". Then, given a specific Unicode character, there
are essentially two ways to identify it. One is via its character
name, which is a finite string of metacharacters A-Z and " " (space);
here, "meta" indicates that these are to be viewed as Unicode
abstract characters, which are outside the Unicode character set. The
second way is by a non-negative integer, which is called the "code
point", but which I prefer to call a character number. Likewise, this
number is "meta" because it is not only outside the Unicode character
set, but also outside any actual computer representation of this
number. It is purely abstract. In order to represent the abstract
characters inside a computer using the character numbers, as the
computer works with binary numbers, one needs to introduce an integer
to binary translation scheme, which is called an "encoding". Here it
gets tricky, because Unicode bundles the character number and various
integer to binary translation schemes together into single logical
entities called "character encodings", which go under the names
UTF-8/16/32.
So now to your questions: The Unicode character "A" has the character
name "LATIN CAPITAL LETTER A", and the character number (or code
point) 65; the latter is just an integer, and you may represent it as
you want. When one writes U+x_1...x_k, that is really a notation
meaning "the Unicode character having character number x_1...x_k in
hexadecimal notation". In your example, the hexadecimal number 41 is
the same as the decimal number 65. So they represent the same
character. Still, these are just abstract numbers. In order to get it
into a computer, one must find a binary representation. In UTF-8, 65
is represented as a binary number 01000001. Such binary numbers can
easily be written using hexadecimal numbers, in which case it is 41.
The clever thing here is that the orginal ASCII characters have
Unicode numbers in such a way that in UTF-8, they get the same binary
representation as in ASCII. But for other characters, there is no
such representation. In the encodings UTF-16 and UTF-32, one get the
same result, if one on forgets about the leading bytes with value 0.
-- Hans Aberg
This archive was generated by hypermail 2.1.5 : Thu Apr 28 2005 - 15:19:12 CST