Re: Unicode = 3 Acsii Characters? (again)

From: John Cowan (jcowan@reutershealth.com)
Date: Mon May 15 2000 - 16:05:19 EDT

Next message: Roozbeh Pournader: "Re: COPYLEFT SIGN"
Previous message: John Cowan: "Re: COPYLEFT SIGN"
Maybe in reply to: Wilbur Wong: "Unicode = 3 Acsii Characters? (again)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Wilbur Wong wrote:

> > So can I concluded that 1 Unicode Character = 3 Ascii Characters?

No, but one Unicode character in UTF-8 representation is *at most* 4 bytes.
Current Unicode characters can all be represented in 3 bytes or less,
but this is going to change soon. So allow 40 bytes for a 10-character
UTF-8 value.

> > BTW, is there any way that I can find out the Unicode code number of that
> > Unicode character by the information given in the 3 Ascii Characters?

You need to think of the 1-4 bytes as *bytes*, not as "ASCII characters".
Based on that, the following rules work:

If the 1st byte is 0-127, then it is a Unicode value all by itself.
If the 1st byte is 128-191, it is an error.
If the 1st byte is 192-223, it is a 2-byte Unicode value, namely
        (1st byte - 192) * 64 + (2nd byte - 128).
If the 1st byte is 224-239, it is a 3-byte Unicode value, namely
        (1st byte - 224) * 4096 + (2nd byte - 128) * 64 + (3rd byte - 128)
If the 1st byte is 240-248, it is a 4-byte Unicode value, namely
        (1st byte - 240) * 262144 + (2nd byte - 128) * 4096 +
                (3rd byte - 128) * 64 + (4th byte - 128).
If the 1st byte is 249-255, it is an error.

-- 
Schlingt dreifach einen Kreis um dies! || John Cowan <jcowan@reutershealth.com>
Schliesst euer Aug vor heiliger Schau,  || http://www.reutershealth.com
Denn er genoss vom Honig-Tau,           || http://www.ccil.org/~cowan
Und trank die Milch vom Paradies.            -- Coleridge (tr. Politzer)

Next message: Roozbeh Pournader: "Re: COPYLEFT SIGN"
Previous message: John Cowan: "Re: COPYLEFT SIGN"
Maybe in reply to: Wilbur Wong: "Unicode = 3 Acsii Characters? (again)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT