From: Otto Stolz (Otto.Stolz@uni-konstanz.de)
Date: Wed Feb 05 2003 - 13:43:25 EST
Erik.Ostermueller@alltel.com wrote:
> I'm dealing with an API that claims it doesn't support unicode characters with embedded nulls.
...
> Test all constituent bytes for 0x00.
This depends on the encoding form you are using (and the API is expecting):
- UTF-8 encodes a Unicode string into a sequence of bytes;
this sequence contains no 0x00 bytes.
Btw., ASCII characters are encoded the same way as in ASCII.
- UTF-16 encodes a Unicode string into a sequence of 16-bit units,
hence it makes no sense to look at this encoding bytewise.
If you nevertheless treat a 16-bit unit as a sequence of two bytes
(repeat: this is a no-no), then you will most probably find
0x00 bytes therein; in particular, every ASCII character is
encoded as a sequence of the respective ASCII byte and a 0x00 byte
(both orders are possible, cf.
<http://www.unicode.org/faq/utf_bom.html>).
- UTF-32 encodes a Unicode string into a sequence of 32-bit units,
hence it makes no sense to look at this encoding bytewise.
If you nevertheless treat a 32-bit unit as a sequence of four bytes
(repeat: this is a no-no), then you will certainly find
0x00 bytes therein; in particular, every ASCII character is
encoded as a sequence of the respective ASCII byte and three
0x00 bytes.
Best wishes,
Otto Stolz
This archive was generated by hypermail 2.1.5 : Wed Feb 05 2003 - 14:26:54 EST