From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Wed Feb 05 2003 - 13:40:09 EST
Erik Ostermueller wrote:
> I'm dealing with an API that claims it doesn't support
> unicode characters with embedded nulls.
> I'm trying to figure out how much of a liability this is.
If by "embedded nulls" they mean bytes of value zero, that library can
*only* work with UTF-8. The other two UTF's cannot be supported in this way.
But are you sure you understood clearly? Didn't they perhaps write "Unicode
*strings* with embedded nulls? In that case they could have meant that null
*characters* inside strings. I.e., they don't support strings containing the
Unicode character U+0000, because that code is used as a string terminator.
In this case, it would be a common and accepted limitation.
> What is my best plan of attack for discovering precisely
> which code points have embedded nulls
> given a particular encoding? Didn't find it in the maillist archive.
> I've googled for quite a while with no luck.
The question doesn't make sense. However:
UTF-8: Only one character is affected (U+0000 itself);
UTF-16: In range U+0000..U+FFFF (Basic Multilingual Plane), there are of
course exactly 511 characters affected (all those of form U+00xx or U+xx00),
484 of which are actually assigned. However, a few of these code points are
high or low surrogates, which means that also many characters in range
U+010000..U+10FFFF are affected.
UTF-32: All characters are affected, because the high byte of an UTF-32 unit
is always 0x00.
> I'll want to do this for a few different versions of unicode
> and a few different encodings.
Most single and double-byte encodings behave like UTF-8 (i.e., a single
zero-byte is only needed to encode U+0000 itself).
> What if I write a program using some of the data files
> available at unicode.org?
> Am I crazy (I'm new at this stuff) or am I getting warm?
> Perhaps this data file:
> http://www.unicode.org/Public/UNIDATA/UnicodeData.txt ?
>
> Algorithm:
> INPUT: Name of unicode code point file
> INPUT: Name of encoding (perhaps UTF-8)
>
> Read code point from file.
> Expand code point to encoded format for the given encoding.
> Test all constituent bytes for 0x00.
> Goto next code point from file.
That would be totally useless, I am afraid.
The only UTF for which this count makes sense is UTF-8, and the result is
"one".
_ Marco
This archive was generated by hypermail 2.1.5 : Wed Feb 05 2003 - 14:20:42 EST