Re: Do the CR & LF bytes in UTF-8 ONLY exist in this form?

From: Mark Crispin (mrc+unicode@panda.com)
Date: Tue Aug 25 2009 - 14:12:15 CDT

  • Next message: alopecoid: "Re: Do the CR & LF bytes in UTF-8 ONLY exist in this form?"

    On Tue, 25 Aug 2009, alopecoid wrote:
    > I know that the ASCII characters are the same in UTF-8 as they are in
    > ASCII. I also know that, in general, UTF-8 characters can be anywhere
    > between 1 and 4 bytes. My question is: can the byte values for the
    > ASCII characters appear by chance as the bytes in the 2nd to 4th
    > positions of other UTF-8 characters?

    No. All continuation bytes have the high order bit set, and the next bit
    clear. Put another way, the 2nd to 4th positions are always between 0xa0
    and 0xbf.

    What's more, the 1st byte is NEVER between 0xa0 and 0xbf. The 1st byte is
    almost in the range 0x00 - 0x7f or 0xc2 - 0xf4.

    That means that bytes 0x80 - 0x9f and 0xf5 - 0xff can never occur in
    UTF-8. In theory, 0xf5 - 0xfd could be used to represent 31-bit
    codepoints outside of the Unicode space; but that is explicitly NOT UTF-8.

    There are other restrictions. Refer to page 104 of the Unicode 5.0
    standard.

    > Is it safe to
    > assume that if I encounter a CR (carriage return, '\r') byte or a LF
    > (line feed, '\n') byte, that this byte belongs to it's own single byte
    > character value?

    Yes.

    > Or can the 8-bits that make up a CR or LF byte just
    > happen to exist in another multi-byte character as bytes 2 through 4
    > of that character?

    I assume that you'll be happy to hear that the answer is "no".

    -- Mark --

    http://panda.com/mrc
    Democracy is two wolves and a sheep deciding what to eat for lunch.
    Liberty is a well-armed sheep contesting the vote.



    This archive was generated by hypermail 2.1.5 : Tue Aug 25 2009 - 14:14:30 CDT