Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE)
asmusf at ix.netcom.com
Wed Jun 4 14:52:02 CDT 2014
On 6/4/2014 12:21 PM, Richard Wordingham wrote:
> On Wed, 04 Jun 2014 11:40:11 -0700
> Asmus Freytag <asmusf at ix.netcom.com> wrote:
>> On 6/4/2014 11:26 AM, Doug Ewell wrote:
>>> I meant U+FEFF as a zero-width no-break space. Obviously it is very
>>> common to see U+FEFF as a signature or BOM.
>> The semantics of it were chosen at the time to make no sense
>> at the start, and to make the character invisible in most situations.
>> The remnant of its semantic was later taken up by Word Joiner, so that
>> there is now NO use for this as part of text.
>> The use as part of a convention has always been clear. If you stick
>> this at the front, readers will byte-reverse your data; that should
>> weed out accidental use pretty quickly :) Or prevent people from
>> getting "cute" with it in other ways.
> Wrong! If you stick U+FEFF at the start of a file, expect it to be
> stripped. If you stick U+FFFE at the start of a file, then expect to
> see the rest of the text to be byte-reversed.
Duh. (reminder, have coffee first)
>> So, I would think that for this particular code point, you can safely
>> assume that it's buggy or test data.
> The example that's usually given is that of a text file sliced into
> segments to avoid file size limits. In these cases, there is the risk
> that U+FEFF as ZWNBSP will wind up at the start of a segment and be
> stripped. The solution using the Windows command window is to perform a
> *binary* concatenation of the segments; if one doesn't, newlines will
> be inserted between the segments, which is much severer damage.
More information about the Unicode