Markus Kuhn wrote:
> [...] LF =
> U+000A = 0x0a = 0xc0 0x8a = 0xe0 0x80 0x8a = 0xf0 0x80 0x80 0x8a = ...
> can be encoded in many ways legally under UTF-8 [...]
Not at all. The Unicode Standard (appendix A) says to encode
everything in the shortest way.
> The fact that Java abuses the 2-byte encoding of the U+0000 (0xc0 0x80)
> to get C string binary transparency for NUL has effectively established
> the practice of using overlong UTF-8 sequences as a hack. :-(
But only in their private protocol for reading and writing String
objects in *binary* files. The names readUTF() and writeUTF() are
misleading, in that UTF-8 is not read or written there. The
UTF-8 codec for InputStreamReader goes by the rules.
-- John Cowan http://www.ccil.org/~cowan cowan@ccil.org Schlingt dreifach einen Kreis um dies! / Schliesst euer Aug vor heiliger Schau, Denn er genoss vom Honig-Tau / Und trank die Milch vom Paradies. -- Coleridge / Politzer
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:50 EDT