RE: Looking For Information

From: Marco.Cimarosti@icl.com
Date: Wed Jun 28 2000 - 05:03:16 EDT


Harry R Aufderheide wrote:
> 1. Is the UTF-8's character set equal to the Latin-1 (ASCII)
> Code Page's? If not, what are the differences?

As Brendan Murray already mentioned, UTF-8 is an encoding form of Unicode,
so it supports *all* Unicode characters.

In case you are wondering how this is possible with 8 bits only, UTF-8 is a
"variable-length" encoding (i.e. different characters have different size,
in bytes), designed to be as compatible as possible with ASCII-based
applications.

Unicode characters U+0000 to U+007F (the "ASCII range") are directly mapped
to bytes 00 to 7F (hex).

All other Unicode characters are represented by sequences of bytes in the
range 80 to FF (hex), in this way:
- U+0080 to U+07FF (European and Middle-East scripts) require *2* bytes,
first one being in range C0 to DF.
- U+0800 to U+FFFF (the rest of Unicode) require *3* bytes, first one being
in range E0 to EF.
- U+00010000 to U+0010FFFF (the new frontier, currently unallocated) require
*4* bytes, first one being in range F0 to F4.

> What about "C" languages?

The buzzwords to look for are: "wchar_t" (the new type to represent wide
characters) and "wchar.h", "wctype.h" (the new header files for
wide-character handling). Both are ANSI stuff and should be implemented in
any compliant C library. However, whether they support Unicode or some other
sort of "wide characters" depends on the implementation.

_ Marco



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:05 EDT