Markus Kuhn wrote on 1999-10-29 11:19 UTC:
> It is actually a shame that when C's mbtowc() discovers a malformed UTF-8
> sequence, it cannot signal back how long this bad sequence is. For instance,
> I find it nicer to treat a UTF-8 sequence with the last byte missing as a
> single malformed sequence, not as a sequence of unexpected bytes. This
> is also how I understood the ISO 10646-1 UTF-8 definition text and what
> xterm implements.
> <http://www.cl.cam.ac.uk/~mgk25/ucs/ISO-10646-UTF-8.html>
I have done some reading in the ISO C standardand Am. 1, and concluded
that the API (mbtowc, mbstowcs, mbrtowc, mbsrtowcs, etc.) does not
provide any facility with which the UTF-8 decoder can signal how long a
single malformed UTF-8 sequence in the sense of ISO 10646-1 section R.7
is. This means, that a C program using mbtowc() or the like will always
have to treat a 4-byte UTF-8 sequence with the last byte missing just
like three separate 1-byte malformed sequences, as opposed to a single
malformed sequence.
The only choices that a C program has when it encounters a malformed
multi-byte sequence are the following:
a) Advance the string pointer until the first valid character is decoded
again. This would lead to any sequence of malformed sequences be treated
as a single malformed sequence
b) Advance the string pointer by one. This would lead to every single
byte in malformed sequences to be treated as a full malformed sequence.
There is however a simple way out of this:
The C library could implement the mbtowc() UTF-8 decoder, such that it
*NEVER* returns -1 to signal that it encountered a malformed sequence.
It could by convention just treat every malformed (and overlong) UTF-8
sequence just like a valid encoding of the REPLACEMENT CHARACTER. I like
the idea of having one error condition less to worry about, and it would
ensure that UTF-8 decoded wide character strings show exactly what also
xterm decodes. For a malformed UTF-8 sequence encoded as U+FFFD,
mbtowc() can return the length of the sequence and thereby jump over the
rest of a single malformed sequence, just like xterm does.
Highly robust handling of UTF-8 is much trickier than one might think at
first, but with hindsight it can be made quite easy by the authors of
libraries.
What do you think?
Markus
-- Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK Email: mkuhn at acm.org, WWW: <http://www.cl.cam.ac.uk/~mgk25/>
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:54 EDT