Re: What to backup after corruption of code units? from Steffen on 2013-08-28 (Unicode Mail List Archive)

From: Steffen <sdaoden_at_gmail.com>
Date: Wed, 28 Aug 2013 11:35:00 +0200

Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:
[.]
|- in UTF-8, you'll need to look backward between 1 to 3 positions before
|your start position to find the leading 8-bit code unit (>= 0xC0).
|
|In both cases you have to check the value found. If you don't find it, in
|the limited range of positions, the input is not valid UTF-8 or UTF-16 and
|you have to handle an encoding error exception in the input stream.
|
|The Unicode standarddoes not specify how you'll handle this error situation
|or from where you'll be able to resync the stream, or even if you should
|resync from some further position; this is application-dependant. If the

«Unicode Security Considerations» [1] gives hints on how defective
byte sequences should or could be handled (in «3.6.1 Illegal Input
Byte Sequences»). This talks about conversion, but should be
applicable everywhere.

[1] <http://www.unicode.org/reports/tr36/>

--steffen
Received on Wed Aug 28 2013 - 04:38:14 CDT

This archive was generated by hypermail 2.2.0 : Wed Aug 28 2013 - 04:38:21 CDT