Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:
[.]
|- in UTF-8, you'll need to look backward between 1 to 3 positions before
|your start position to find the leading 8-bit code unit (>= 0xC0).
|
|In both cases you have to check the value found. If you don't find it, in
|the limited range of positions, the input is not valid UTF-8 or UTF-16 and
|you have to handle an encoding error exception in the input stream.
|
|The Unicode standarddoes not specify how you'll handle this error situation
|or from where you'll be able to resync the stream, or even if you should
|resync from some further position; this is application-dependant. If the
«Unicode Security Considerations» [1] gives hints on how defective
byte sequences should or could be handled (in «3.6.1 Illegal Input
Byte Sequences»). This talks about conversion, but should be
applicable everywhere.
[1] <http://www.unicode.org/reports/tr36/>
--steffen
Received on Wed Aug 28 2013 - 04:38:14 CDT
This archive was generated by hypermail 2.2.0 : Wed Aug 28 2013 - 04:38:21 CDT