"Valeriy E. Ushakov" wrote on 1999-09-26 17:10 UTC:
> > U+0000 = c0 80
>
> I belive that's exactly what JDK uses to encode U+0000 in utf-8
> encoded NUL terminated C strings to distinguish U+0000 which is part
> of a string from the terminating NUL.
It probably would help to avoid confusion, if the Java documentation
introduced a new name for this encoding. Good and clear terminology is
never a bad thing.
Suggestion:
UTF-8Z = zero-free UTF-8 encoding, which differs from
UTF-8 only for one character, namely U+0000 = c0 80
But then, Java uses UTF-8Z only as an internal encoding, and not in its
UTF-8 I/O functions.
I think, is was a curious design decision:
I probably would have selected U+0000 = fe. This is as malformed as
c0 80, but has the big advantage that UTF-8 and UTF-Z would then always
have had the same length. Note that fe and ff are unused in UTF-8.
Markus
-- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:53 EDT