Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE)

Doug Ewell doug at
Fri Jun 6 11:15:23 CDT 2014

Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

>> If you have an arbitrary fragment of data, don't fiddle with it.
> Thisis your scenario. The simple concept of a unique "start" of text
> does not exist in live streams that can start anywhere. So you cannot
> always expect that U+FEFF or U+FFFE will only exist once in a strram
> and necessaryly at the start of position where you can start reading
> it because you may already be past the initial creation of the stream
> without having any wya to come back to the "start".

An "arbitrary fragment of data" -- I'm going to keep using the exact
same phrase until it sinks in -- DOES have a start and an end. THAT is
my scenario.

> Your assumption just assumes that you can always "rewind" your file,

My assumption assumes no such thing.

> Now you will argue: this live stream is not plain text, it has a
> binary structure.

Well, yes.

> Yes but only if your consumer application wants to process the full
> multiplex. Typically clients will demultiplex the stream and pass it
> down to a simpler client that absolutely does not care about the
> transport multiplex format. If that downward client is just used to
> display the incoming text, it will just wait for text that will be
> buffered ine by line and displayed immediately where there's a newline
> separator. But even in this case, each line may have been fragmented
> so that each fragment will contain a leading BOM which will nto be
> necessarily stripped

Question: Why did the process that broke the stream into fragments add
leading BOMs?

> (you have also incorrectly asuumed that a text stream is necessaily
> transported over a "reliable" protocol like TCP where there can be no
> data loss in the middle

Really. I think you have incorrectly asuumed my asuumption.

> Texts are inhernetly fragmentable. Initially they are transcripts of
> human communication and nobody in real life is permanently connected
> to someone else and able to remember eveything that was said by
> someone else.

OK, I think are far enough removed from Unicode to end this.

Doug Ewell | Thornton, CO, USA | @DougEwell

More information about the Unicode mailing list