RE: UTF-8 and UTF-16

From: Marco.Cimarosti@icl.com
Date: Fri Oct 06 2000 - 04:01:44 EDT


George Zeigler wrote:
> someone send me a FAQ page that explains the difference
> between UTF-8 and Unicode (UTF-16 I suppose).

You should perhaps read it again ;-)

> UTF-8 if I understand correctly only supports
> European characters, where as UTF-16 supports all major
> characters world wide.
> I notice that our browser has UTF-7 and UTF-8.

No, sorry, you didn't understand correctly.

The "UTF" in UTF-7, UTF-8, UTF-16 (and UTF-32) stands for "**Unicode**
transformation format". They are all different encodings for Unicode and,
therefore, they all encode the whole Unicode range: from 000000 hex to
10FFFF hex (totalling 1114112 slots).

The difference between them lays, basically, in the size of the *encoding*
*unit*. The number in each UTF name is the encoding unit size in bits. E.g.,
UTF-8 uses 8-bit units (a.k.a. "octets" or "bytes"). This doesn't mean that
UTF-8 is limited to 256 slots; it just means that most Unicode characters
take up more than one unit (byte) to encode.

- UTF-32 has 32-bit units ("double words") and uses a single unit per
character. Double words 00110000 to FFFFFFFF are unused, because they exceed
the Unicode range. The encoding exists in 3 variants: little-endian,
high-endian, and a third one where endianness is specified by a leading mark
called BOM (byte-order mark).

- UTF-16 has 16-bit units ("words") and uses 1 or 2 units per character.
Characters 000000 to 00FFFF use the corresponding word; higher values use a
pair of "surrogates", the first one ("high") being in . It too exists in the
same 3 variants as bove: little-endian, high-endian, and BOM-marked.

- UTF-8 has 8-bit units ("bytes") and uses 1 to 4 units per character.
Characters 000000 to 00007F are represented by the corresponding bytes in
range 00 to 7F; all other characters are represented by sequences of 2 or
more bytes in range 80 to FF. UTF-8 has no endianness problem, but it still
exists in two variants: with and without a leading BOM (that, in this case,
simply acts as a signature).

- UTF-7 has 7-bit units ("ASCII bytes"). Most "ASCII bytes" are used to
encode corresponding Unicode characters in range 000000 to 00007F; some
others (namely "+" and "-") are used as escape sequences to switch the
meaning of bytes from "ASCII" to a numerical representation of Unicode.
Unicode characters (apart those encoded as single bytes) are 3 bytes long;
to this you must add the "+"'s and "-"'s used to escape in and out. The
overall number of bytes needed for an Unicode character depends on context.
UTF-7 has no endianness issue and no signature, but it is limited to values
000000 to 00FFFF, so it assumes a previous UTF-16 transformation to encode
higher values.

Hope this helps.

_ Marco



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT