Re: Do the CR & LF bytes in UTF-8 ONLY exist in this form?

From: John (Eljay) Love-Jensen (eljay@adobe.com)
Date: Tue Aug 25 2009 - 13:48:41 CDT

  • Next message: Asmus Freytag: "Re: Visarga, ardhavisarga and anusvara -- combining marks or not?"

    Hi alopecoid,

    > can the byte values for the ASCII characters appear by chance as the bytes in
    the 2nd to 4th positions of other UTF-8 characters?

    No. Only 0x80 - 0xBF appear in the 2nd to 4th positions.

    > Is it safe to assume that if I encounter a CR (carriage return, '\r') byte or
    a LF (line feed, '\n') byte, that this byte belongs to it's own single byte
    character value?

    Yes.

    > Or can the 8-bits that make up a CR or LF byte just happen to exist in another
    multi-byte character as bytes 2 through 4 of that character?

    No.

    All "trailing" UTF-8 encoding units have the bit pattern 10xxxxxx, so they
    will always be between 0x80 - 0xBF, safely avoiding '\n' (0x0A) and '\r'
    (0x0D).

    > I hope my question is clear.

    Yes.

    > Thank you.

    You're welcome.

    Sincerely,
    --Eljay



    This archive was generated by hypermail 2.1.5 : Tue Aug 25 2009 - 13:52:16 CDT