From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Nov 05 2003 - 06:32:57 EST
From: "Abdij Bhat" <Abdij.Bhat@kshema.com>
> If a UNICODE strings is converted to UTF8, will the UTF8 encoded string
> contain and control character or escape sequences? If so, is it possible
to
> eliminate the same?
UTF-8 sequences will not contain any C0 control bytes, but it will in many
cases use contain C1 control bytes (between 0x80 and 0x9F).
UTF-8 keeps all 7-bit ASCII characters unchanged and does not create any
sequence of bytes containing them for non 7-bit ASCII characters (all
sequences of UTF-8 bytes are made of bytes>=0x80). UTF-8 will then never
create any escape sequence.
But be warned that you should not create escape sequences containing bytes
>= 0x80 after the leading escape (in this case, they may conflict with a
UTF-8 decoder).. If your escape sequences are made only of 7-bit ASCII
bytes, then this is safe, and you can mix plain-text ASCII, C0 controls,
escape sequences and UTF-8 sequences for non ASCII characters.
Note that C1 controls of Unicode and ISO-8859-* will be converted to a pair
of bytes in UTF-8, with the first byte being 0xC2, and the second byte
varying between 0x80 and 0x9F (so C1 controls will appear in UTF-8 with a
0xC2 "prefix" before the same byte when encoding them with ISO-8859-*)
This archive was generated by hypermail 2.1.5 : Wed Nov 05 2003 - 07:19:36 EST