From: Jill Ramonsky (Jill.Ramonsky@Aculab.com)
Date: Tue Nov 04 2003 - 09:37:00 EST
Hi,
What is a conforming application supposed to do if, when decoding a
UTF-8 stream (or indeed a UTF-32 stream, etc.), it encounters a sequence
of bytes which decodes to U+D800, U+DF00 ?
Of course, if such a sequence were encountered during UTF-16 processing
it would be pretty obvious, but I'm not talking UTF-16 any more. At
least, not directly. Nonetheless, such a sequence could arise if
Application A encodes text to a file using UTF-16, which is then read by
Application B (a very old, legacy application, unaware of the existence
of codepoints above U+FFFF) and re-saved in UTF-8.
This question generalises to ... should /all/ encoding schemes treat
surrogate pairs as surrogate pairs, or just UTF-16 ?
This question generalises further still, to ... do the phrases
"surrogate character" and "surrogate pair" have any meaning whatsoever
outside UTF-16?
Jill
This archive was generated by hypermail 2.1.5 : Tue Nov 04 2003 - 11:10:48 EST