From: alopecoid (alopecoid@gmail.com)
Date: Tue Aug 25 2009 - 13:33:18 CDT
Hi,
I am having difficulty finding the answer to this question, so I
figured this might be the best place to ask.
I know that the ASCII characters are the same in UTF-8 as they are in
ASCII. I also know that, in general, UTF-8 characters can be anywhere
between 1 and 4 bytes. My question is: can the byte values for the
ASCII characters appear by chance as the bytes in the 2nd to 4th
positions of other UTF-8 characters?
For example, let's say that I would like to read lines from a UTF-8
encoded text file, but I don't need to actually decode each line... I
just need to store the UTF-8 encoded lines somewhere. Is it safe to
assume that if I encounter a CR (carriage return, '\r') byte or a LF
(line feed, '\n') byte, that this byte belongs to it's own single byte
character value? Or can the 8-bits that make up a CR or LF byte just
happen to exist in another multi-byte character as bytes 2 through 4
of that character?
I hope my question is clear.
Thank you.
This archive was generated by hypermail 2.1.5 : Tue Aug 25 2009 - 13:39:25 CDT