Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored? from Costello, Roger L. via Unicode on 2017-07-24 (Unicode Mail List Archive)

From: Costello, Roger L. via Unicode <unicode_at_unicode.org>
Date: Mon, 24 Jul 2017 14:39:40 +0000

Hello Unicode Experts!

Suppose an application splits a UTF-8 multi-octet sequence. The application then sends the split sequence to a client. The client must restore the original sequence.

Question: is it possible to split a UTF-8 multi-octet sequence in such a way that the client cannot unambiguously restore the original sequence?

Here is the source of my question:

The iCalendar specification [RFC 5545] says that long lines must be folded:

        Long content lines SHOULD be split
         into a multiple line representations
         using a line "folding" technique.
         That is, a long line can be split between
         any two characters by inserting a CRLF
         immediately followed by a single linear
         white-space character (i.e., SPACE or HTAB).

The RFC says that, when parsing a content line, folded lines must first be unfolded using this technique:

        Unfolding is accomplished by removing
         the CRLF and the linear white-space
         character that immediately follows.

The RFC acknowledges that simple implementations might generate improperly folded lines:

        Note: It is possible for very simple
        implementations to generate improperly
         folded lines in the middle of a UTF-8
         multi-octet sequence. For this reason,
         implementations need to unfold lines
         in such a way to properly restore the
         original sequence.

Can you provide an example of folding a UTF-8 multi-octet sequence such that there is no unambiguous way to restore the original sequence?

/Roger
Received on Mon Jul 24 2017 - 09:40:03 CDT

This archive was generated by hypermail 2.2.0 : Mon Jul 24 2017 - 09:40:03 CDT