Re: Unicode = 3 Acsii Characters? (again)

From: John Cowan (jcowan@reutershealth.com)
Date: Mon May 15 2000 - 16:05:19 EDT


Wilbur Wong wrote:

> > So can I concluded that 1 Unicode Character = 3 Ascii Characters?

No, but one Unicode character in UTF-8 representation is *at most* 4 bytes.
Current Unicode characters can all be represented in 3 bytes or less,
but this is going to change soon. So allow 40 bytes for a 10-character
UTF-8 value.

> > BTW, is there any way that I can find out the Unicode code number of that
> > Unicode character by the information given in the 3 Ascii Characters?

You need to think of the 1-4 bytes as *bytes*, not as "ASCII characters".
Based on that, the following rules work:

If the 1st byte is 0-127, then it is a Unicode value all by itself.
If the 1st byte is 128-191, it is an error.
If the 1st byte is 192-223, it is a 2-byte Unicode value, namely
        (1st byte - 192) * 64 + (2nd byte - 128).
If the 1st byte is 224-239, it is a 3-byte Unicode value, namely
        (1st byte - 224) * 4096 + (2nd byte - 128) * 64 + (3rd byte - 128)
If the 1st byte is 240-248, it is a 4-byte Unicode value, namely
        (1st byte - 240) * 262144 + (2nd byte - 128) * 4096 +
                (3rd byte - 128) * 64 + (4th byte - 128).
If the 1st byte is 249-255, it is an error.

-- 

Schlingt dreifach einen Kreis um dies! || John Cowan <jcowan@reutershealth.com> Schliesst euer Aug vor heiliger Schau, || http://www.reutershealth.com Denn er genoss vom Honig-Tau, || http://www.ccil.org/~cowan Und trank die Milch vom Paradies. -- Coleridge (tr. Politzer)



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT