Markus Kuhn wrote:
> There is however a simple way out of this:
>
> The C library could implement the mbtowc() UTF-8 decoder, such that it
> *NEVER* returns -1 to signal that it encountered a malformed sequence.
> It could by convention just treat every malformed (and overlong) UTF-8
> sequence just like a valid encoding of the REPLACEMENT CHARACTER.
This is almost exactly what the Plan 9 implementation does, except that it uses
a different character, on the grounds that an encoding error is not the same as
an unrepresentable character (the higher-level recovery strategy, if any,
is different). The implementers' specific choice was the (basically)
unused control character U+0080.
--John Cowan http://www.reutershealth.com jcowan@reutershealth.com Schlingt dreifach einen Kreis vom dies / Schliess eurer Aug vor heiliger Schau Den er genoss vom Honig-Tau / Und trank die Milch vom Paradies. -- Coleridge (tr. Politzer)
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT