From: Mark Crispin (mrc+unicode@panda.com)
Date: Tue Aug 25 2009 - 14:12:15 CDT
On Tue, 25 Aug 2009, alopecoid wrote:
> I know that the ASCII characters are the same in UTF-8 as they are in
> ASCII. I also know that, in general, UTF-8 characters can be anywhere
> between 1 and 4 bytes. My question is: can the byte values for the
> ASCII characters appear by chance as the bytes in the 2nd to 4th
> positions of other UTF-8 characters?
No. All continuation bytes have the high order bit set, and the next bit
clear. Put another way, the 2nd to 4th positions are always between 0xa0
and 0xbf.
What's more, the 1st byte is NEVER between 0xa0 and 0xbf. The 1st byte is
almost in the range 0x00 - 0x7f or 0xc2 - 0xf4.
That means that bytes 0x80 - 0x9f and 0xf5 - 0xff can never occur in
UTF-8. In theory, 0xf5 - 0xfd could be used to represent 31-bit
codepoints outside of the Unicode space; but that is explicitly NOT UTF-8.
There are other restrictions. Refer to page 104 of the Unicode 5.0
standard.
> Is it safe to
> assume that if I encounter a CR (carriage return, '\r') byte or a LF
> (line feed, '\n') byte, that this byte belongs to it's own single byte
> character value?
Yes.
> Or can the 8-bits that make up a CR or LF byte just
> happen to exist in another multi-byte character as bytes 2 through 4
> of that character?
I assume that you'll be happy to hear that the answer is "no".
-- Mark --
http://panda.com/mrc
Democracy is two wolves and a sheep deciding what to eat for lunch.
Liberty is a well-armed sheep contesting the vote.
This archive was generated by hypermail 2.1.5 : Tue Aug 25 2009 - 14:14:30 CDT