Re: Looking For Information

From: John Cowan (jcowan@reutershealth.com)
Date: Tue Jun 27 2000 - 12:12:07 EDT


"AUFDERHEIDE HARRY R. (app1hra)" wrote:

> 1. Is the UTF-8's character set equal to the Latin-1 (ASCII) Code Page's?

No. UTF-8 and UTF-16 support the exact same repertoire of 41,000+ characters,
a superset of essentially every character set now in use.

> If not, what are the differences?

Latin-1 uses a single byte per character and encodes 256 characters.
UTF-8 uses 1 to 4 bytes per character, depending on the character, and encodes
all of Unicode's repertoire. Since all characters in the ASCII repertoire
use a single byte, UTF-8 is upward compatible with ASCII, but *not* with Latin-1
as such.

> Under the assumption that it is substantially the same; I don't see
> it solving our problems
> as we are currently processing more characters than this can
> support. It certainly doesn't
> appear a solution for handling Chinese, Japanese, etc.
>
> This leads me to the UTF-16 format with its double byte capability.

In UTF-16, essentially all characters are supported in 2 bytes each. Some
not-yet-assigned characters will require two consecutive 2-byte codes.
These special codes ("surrogates") are assigned from a range that does not
conflict with normal characters.

> What about "C" languages?

There are excellent libraries freely available for C/C++. Java has built-in
support.

> What else should we be aware of?

Lots, see http://www.unicode.org

-- 

Schlingt dreifach einen Kreis um dies! || John Cowan <jcowan@reutershealth.com> Schliesst euer Aug vor heiliger Schau, || http://www.reutershealth.com Denn er genoss vom Honig-Tau, || http://www.ccil.org/~cowan Und trank die Milch vom Paradies. -- Coleridge (tr. Politzer)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT