RE: What to backup after corruption of code units? from Doug Ewell on 2013-08-28 (Unicode Mail List Archive)

From: Doug Ewell <doug_at_ewellic.org>
Date: Wed, 28 Aug 2013 19:04:45 -0600

I'd take whichever approach gave me the approved behavior for detecting and isolating invalid sequences. Unicode specifies that the minimal invalid sequence be handled, and AFAICT the handling should be the same for <41 C0 42> as it is for <41 80 42>.

--
Doug Ewell | Thornton, CO, USA
http://ewellic.org | @DougEwell
-----Original Message-----
From: "Asmus Freytag" <asmusf_at_ix.netcom.com>
Sent: ‎8/‎28/‎2013 18:52
To: "Doug Ewell" <doug_at_ewellic.org>
Cc: "Ian Clifton" <ian.clifton_at_chem.ox.ac.uk>; "Unicode discussion" <unicode_at_unicode.org>
Subject: Re: What to backup after corruption of code units?
On 8/28/2013 5:19 PM, Doug Ewell wrote:
> Actually 0xC2, according to the rules of UTF-8.
Hmm. What you are referring to is that 0xC0 and 0xC1 don't occur because 
of the requirement for minimal length encoding. However, a check for 
 >=0xC0 will give the correct result for backing up, assuming the data 
is valid UTf-8 (or at least locally valid).
In terms of boundary determination, would you take violating the rule 
about minimal length encoding as evidence for corrupted data, or would 
you first detect the boundary, then decide that a sequence starting with 
0xC0 is in violation?
A./
>
> --
> Doug Ewell | Thornton, CO, USA
> http://ewellic.org | @DougEwell
> ------------------------------------------------------------------------
> From: Ian Clifton <mailto:ian.clifton_at_chem.ox.ac.uk>
> Sent: ‎8/‎28/‎2013 17:34
> To: Unicode discussion <mailto:unicode_at_unicode.org>
> Subject: Re: What to backup after corruption of code units?
>
> On 28/08/13 23:29, Xue Fuqiao wrote:
> > I see.  Thanks for all your replies!
> >
> > BTW I have a further question:
> >
> > On Wed, Aug 28, 2013 at 1:44 PM, Philippe Verdy<verdy_p_at_wanadoo.fr>  
> wrote:
> >> - in UTF-8, you'll need to look backward between 1 to 3 positions 
> before
> >> your start position to find the leading 8-bit code unit (>= 0xC0).
> > Why should this be >=0xC0?
> >
>
> Because a well‐formed UTF-8 header byte must start with at least two 1
> bits, numerically, the smallest such byte is 16#C0#.
>
> -- 
> Ian ◎
>
>
>

Received on Wed Aug 28 2013 - 20:06:33 CDT

This archive was generated by hypermail 2.2.0 : Wed Aug 28 2013 - 20:06:33 CDT