Re: Roundtripping in Unicode

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Dec 11 2004 - 17:19:17 CST

  • Next message: Philippe Verdy: "Re: Roundtripping in Unicode"

    RE: Roundtripping in UnicodeMy view about this "problem" of roundtripping is
    that if data, supposed to contain only valid UTF-8 sequences, contains some
    invalid byte sequences that still need to be roundtripped to some "code
    point" for internal management that can be roundtripped later to the
    original invalid byte sequence, then these invalid bytes MUST NOT be
    converted to valid code points.

    An implementation based on internal UTF-32 code units representation could
    use, privately only, only the range which is NOT assigned to valid Unicode
    code points; so such application would need to convert these bytes into code
    points higher than 0x10FFFF; but the same application will no longer be
    conforming to strict UTF-32 requirements: the application will represent
    this way binary data which is NOT bound to Unicode rules and which can't be
    valid plain-text.
    For example, {0xFF0000+n} where n is the byte value to encapsulate. Don't
    call it "UTF-32", because it MUST remain for private use only!

    This will be more complex if the application uses UTF-16 code units, because
    there are only TWO code units that can be used to recognize such
    invalid-text data within a text stream. It is possible to do that, but with
    MUCH care:
    For example encoding 0xFFFE before each byte value converted to some 16-bit
    code unit. The problem is that backward parsing of strings just check that a
    code unit is a low surrogate, to see if a second backward step is needed to
    get the first high surrogate, and so U+FFFE would need to be used (privately
    only) as another lead high surrogate with special (internal) meaning for
    round trip compatibility, and so the best choice for the code unit encoding
    the invalid byte value would be to use a standard low surrogate to store
    this byte. So a qualifying internal representation would be {0xFFFE,
    0xDC00+n} where n is the byte value to encapsulate.
    Don't call this "UTF-16", because it is not UTF-16.

    An implementation that uses UTF-8 for valid string could use the invalid
    ranges for lead bytes to encapsultate invalid byte values. Note however that
    invalid bytes you would need to represent have 256 possible values, but the
    UTF-8 lead bytes have only 2 reserved values (0xC0 and 0xC1) each for 64
    codes, if you want to use an encoding on two bytes. The alternative would be
    to use the UTF-8 lead byte values which have initially been assigned to byte
    sequences longer than 4 bytes, and that are now unassigned/invalid in
    standard UTF-8. For example: {0xF8+(n/64); 0x80+(n%64)}.
    Here also it will be a private encoding, that should NOT be named UTF-8, and
    the application should clearly document that it will not only accept any
    valid Unicode string, but also some invalid data which will have some
    roundtrip compatibility.

    So what is the problem: suppose that the application, internally, starts to
    generate strings containing any occurences of such private sequences, then
    it will be possible for the application to generate on its output a byte
    stream that would NOT have roundtrip compatibility, back to the private
    representation. So roundtripping would only be guaranteed for streams
    converted FROM an UTF-8 where some invalid sequences are present and must be
    preserved by the internal representation. So the transformation is not
    bijective as you would think, and this potentially creates lots of possible
    security issues.

    So for such application, it would be much more appropriate to use different
    datatypes and structures to represent either streams of binary bytes, or
    streams of characters, and recognize them independantly. The need of a
    bijective representation means that the input stream will contain an
    encapsultation to recognize *exactly* if the stream is text or binary.

    If the application is a filesystem storing filenames and there's no place in
    the filesystem to encode if a filename is binary or text, then you are left
    without any secured solution!

    So the best thing you can do to secure your application, is to REJECT/IGNORE
    all files whose names do not match the strict UTF-8 encoding rules that your
    application expect (all will happen as if those files were not present, but
    this may still create security problems if an application that does not see
    any file in a directory wants to delete that directory, assuming it is
    empty... In that case the application must be ready to accept the presence
    of directories without any content, and must not depend on the presence of a
    directory to determine that it has some contents; anyway, on secured
    filesystems, such things could happen due to access restrictions, completely
    unrelated to the encoding of filenames, and it is not unreasonnable to
    prepare the application so that it will behave correctly face to
    inaccessible files or directories, so that the application will also
    correctly handle the fact that the same filesystem will contain non
    plain-text and inaccessible filenames).

    Anyway, the exposed solutions above demonstrate that there's absolutely NO
    need to reserve in Unicode some range to represent pure binary data, even
    for roundtripping in situations like the one you expose. Instead,
    programmers must clearly study the impact of using invalid byte sequences
    and how they can be tracked throughout the application. The various Unicode
    encoding forms leave enough space to allow such implementations, but Unicode
    will NOT assign code points for binary data which don't have a character
    semantic, because this assignment would become characters valid in
    plain-text!

    So why all these discussions? Because there are various interpretations
    about what is or is not "plain-text". As soon as a system import some
    plain-text specification but applies some restrictions on it, then it is NOT
    plain-text. A filename in a filesystem is NOT plain text because it is only
    a restricted subset of what a plain-text can represent. For Unicode and for
    ISO/IEC 10646, a plain-text data is ANY sequence of characters in the valid
    Unicode range (U+0000 to U+10FFFD minus the surrogates and all
    non-characters), in ANY order, or with ANY size. Plain-text also gives a few
    mandatory interpretations for some characters, notably the end-of-line
    characters (CR, LF, NL, LS, PS...), but no other interpretation for any
    other characters, and no limitation on line-lengths (measured in whatever
    unit such as bytes, code units, characters, combining sequences, grapheme
    clusters, ems, millimeters/picas, pixels, percentage of a container
    width...): plain text is just an ordered list of line records, containing
    valid characters (including control characters, not to confuse with control
    bytes), and optionally terminated by end-of-line character(s).

    For more strict definitions of plain-text, you need to create a
    specification, and make sure that this specification comes first in the
    encapsulation encoding/decoding; if this encapsulation allows any plain-text
    to be represented using some escaping mechanism, this mechanism MUST be a
    mandatory part of the protocol specification. This is where most of the
    legacy filesystems have been failing: their specification is incomplete, or
    is simply wrong when they say that filenames are plain-text. In fact, all
    filesystems have restrictions on valid filenames, because they also need
    that filenames be encapsulted into other text protocols, or even in text
    files that have other restrictions (for example within shell commands, or in
    URLs, or on single lines), and they don't want that these filenames be
    complicate to specify in these external applications (for example,
    encapsulating a filename within a URI using the "tricky" URI-escaping
    mechanism). But I do think this is a bad argument, made only for lazy
    programmers, that often don't use the mandatory parts of these
    specifications that document how encapsulation can be safely performed.

    Notably, the concept of filenames is a legacy and badly designed concept,
    inherited from times where storage space was very limited, and the designers
    wanted to create a compact (but often cryptic) representation.

    The concept of filenames combines too many independant things:
    - a unique moniker used to reference files, and allowing the creation of
    links with possible security restrictions.
    - a summary identification of the content type (with file extensions present
    on most filesystems, including Unix as a nearly universal but unreliable
    convention).
    - sometimes a version identifier or number (on VMS devices, or on
    CDFS/ISO9660), for archival purpose.
    - sometimes a data channel identifier (on filesystems that support multiple
    data streams with independant datatypes and storage, for the same file, such
    as NTFS streams, and MacOS data/resource forks); however this concept is
    quite similar to the concept of hierarchical folders that can be used as
    valid resources with default contents.
    - a description of the content (which is meta-data), but in a way that it is
    too much truncated and must nearly always be interpreted relatively to the
    description of the directory into which the filename is stored.

    ---
    Lars Kristan wrote:
    > > Furthermore, I was proposing this concept to be used, but not
    > > unconditionally. So, you can, possibly even should, keep using
    > > whatever you are using.
    >
    > So you prefer to make programs misbehave in unpredictable ways
    > (when they pass the data from a component which uses relaxed rules
    > to a component which uses strict rules) rather than have a clear and
    > unambiguous notion of a valid UTF-8?
    I am not particulary thrilled about it. In fact it should be discussed. 
    Constructively. Simply assuming everything will break is not helpful. But if 
    you want an answer, yes, I would go for it. Actually, there are fewer 
    concerns involved than people think. Security is definitely an issue. But 
    again, one shouldn't assume it breaks just like that. Let me risk a bold 
    statement: security is typically implicitly centralized. And if comparison 
    is always done in the same UTF, it won't break. A simple fact that two 
    different UTF-16 strings compare equal in UTF-8 (after relaxed conversion), 
    does not introduce a security issue. Today, two invalid UTF-8 strings 
    compare the same in UTF-16, after a valid conversion (using a single 
    replacement char, U+FFFD) and they compare different in their original form, 
    if you use strcmp. But you probably don't. Either you do everything in 
    UTF-8, or everything in UTF-16. Not always, but typically. If comparisons 
    are not always done in the same UTF, then you need to validate. And not 
    validate while converting, but validate on its own. And now many designers 
    will remember that they didn't. So, all UTF-8 programs (of that kind) will 
    need to be fixed. Well, might as well adopt my broken conversion and fix all 
    UTF-16 programs. Again, of that kind, not all in general, so there are few. 
    And even those would not be all affected. It would depend on which 
    conversion is used where. Things could be worked out. Even if we would start 
    changing all the conversions. Even more so if a new conversion is added and 
    only used when specifically requested.
    There is cost and there are risks. Nothing should be done hastily. But let's 
    go back and ask ourselves what are the benefits. And evaluate the whole.
    >
    > > Perhaps I can convert mine, but I cannot convert all filenames on
    > > a user's system.
    >
    > They you can't access his files.
    Yes, this is where it all started. I cannot afford not to access the files. 
    I am not writing a notepad.
    >
    > With your proposal you couldn't as well, because you don't make them
    > valid unconditionally. Some programs would access them and some would
    > break, and it's not clear what should be fixed: programs or filenames.
    It is important to have a way to write programs that can. And, there is 
    definitely nothing to be fixed about the filenames. They are there and 
    nobody will bother to change them. It is the programs that need to be fixed. 
    And if Unicode needs to be fixed to allow that, then that is what is 
    supposed to happen. Eventually.
    Lars 
    


    This archive was generated by hypermail 2.1.5 : Sat Dec 11 2004 - 17:20:57 CST