From: Kenneth Whistler (kenw@sybase.com)
Date: Wed May 14 2003 - 17:58:29 EDT
Ben,
> could someone confirm if i've got this correct, or not please?:
>
> a 'code unit' could be the same as a 'code point', but there again it
> might not be. it's possible that several 'code units' are required to
> make up a 'code point'? (so code units can be the same size or smaller
> than a code point, but not the other way round)?
Think of it this way.
The code *point* is a number in the codespace, used to encode
an abstract character. For Unicode, it is a number in the
range 0x0000..0x10FFFF (or think of it as 0..1,114,111 expressed
in decimal). These get expressed with the U+ notation in Unicode.
Thus U+0041 is the code point for LATIN CAPITAL LETTER A.
The code *unit* is a fixed-width integral data type used in the
context of a particular encoding form. The encoded character is
represented in that encoding form by either a single code unit
or a sequence of code units.
In UTF-8, the code unit is always an 8-bit integer. (0x00..0xFF)
In UTF-16, the code unit is always a 16-bit integer. (0x0000..0xFFFF)
In UTF-32, the code unit is always a 32-bit integer.
(0x00000000..0x0010FFFF)
Code units don't "make up a code point".
Rather, a sequence of one or more code units is used to
represent a Unicode encoded character in a particular encoding form.
--Ken
This archive was generated by hypermail 2.1.5 : Wed May 14 2003 - 18:33:26 EDT