From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Dec 11 2004 - 17:19:17 CST
RE: Roundtripping in UnicodeMy view about this "problem" of roundtripping is
that if data, supposed to contain only valid UTF-8 sequences, contains some
invalid byte sequences that still need to be roundtripped to some "code
point" for internal management that can be roundtripped later to the
original invalid byte sequence, then these invalid bytes MUST NOT be
converted to valid code points.
An implementation based on internal UTF-32 code units representation could
use, privately only, only the range which is NOT assigned to valid Unicode
code points; so such application would need to convert these bytes into code
points higher than 0x10FFFF; but the same application will no longer be
conforming to strict UTF-32 requirements: the application will represent
this way binary data which is NOT bound to Unicode rules and which can't be
valid plain-text.
For example, {0xFF0000+n} where n is the byte value to encapsulate. Don't
call it "UTF-32", because it MUST remain for private use only!
This will be more complex if the application uses UTF-16 code units, because
there are only TWO code units that can be used to recognize such
invalid-text data within a text stream. It is possible to do that, but with
MUCH care:
For example encoding 0xFFFE before each byte value converted to some 16-bit
code unit. The problem is that backward parsing of strings just check that a
code unit is a low surrogate, to see if a second backward step is needed to
get the first high surrogate, and so U+FFFE would need to be used (privately
only) as another lead high surrogate with special (internal) meaning for
round trip compatibility, and so the best choice for the code unit encoding
the invalid byte value would be to use a standard low surrogate to store
this byte. So a qualifying internal representation would be {0xFFFE,
0xDC00+n} where n is the byte value to encapsulate.
Don't call this "UTF-16", because it is not UTF-16.
An implementation that uses UTF-8 for valid string could use the invalid
ranges for lead bytes to encapsultate invalid byte values. Note however that
invalid bytes you would need to represent have 256 possible values, but the
UTF-8 lead bytes have only 2 reserved values (0xC0 and 0xC1) each for 64
codes, if you want to use an encoding on two bytes. The alternative would be
to use the UTF-8 lead byte values which have initially been assigned to byte
sequences longer than 4 bytes, and that are now unassigned/invalid in
standard UTF-8. For example: {0xF8+(n/64); 0x80+(n%64)}.
Here also it will be a private encoding, that should NOT be named UTF-8, and
the application should clearly document that it will not only accept any
valid Unicode string, but also some invalid data which will have some
roundtrip compatibility.
So what is the problem: suppose that the application, internally, starts to
generate strings containing any occurences of such private sequences, then
it will be possible for the application to generate on its output a byte
stream that would NOT have roundtrip compatibility, back to the private
representation. So roundtripping would only be guaranteed for streams
converted FROM an UTF-8 where some invalid sequences are present and must be
preserved by the internal representation. So the transformation is not
bijective as you would think, and this potentially creates lots of possible
security issues.
So for such application, it would be much more appropriate to use different
datatypes and structures to represent either streams of binary bytes, or
streams of characters, and recognize them independantly. The need of a
bijective representation means that the input stream will contain an
encapsultation to recognize *exactly* if the stream is text or binary.
If the application is a filesystem storing filenames and there's no place in
the filesystem to encode if a filename is binary or text, then you are left
without any secured solution!
So the best thing you can do to secure your application, is to REJECT/IGNORE
all files whose names do not match the strict UTF-8 encoding rules that your
application expect (all will happen as if those files were not present, but
this may still create security problems if an application that does not see
any file in a directory wants to delete that directory, assuming it is
empty... In that case the application must be ready to accept the presence
of directories without any content, and must not depend on the presence of a
directory to determine that it has some contents; anyway, on secured
filesystems, such things could happen due to access restrictions, completely
unrelated to the encoding of filenames, and it is not unreasonnable to
prepare the application so that it will behave correctly face to
inaccessible files or directories, so that the application will also
correctly handle the fact that the same filesystem will contain non
plain-text and inaccessible filenames).
Anyway, the exposed solutions above demonstrate that there's absolutely NO
need to reserve in Unicode some range to represent pure binary data, even
for roundtripping in situations like the one you expose. Instead,
programmers must clearly study the impact of using invalid byte sequences
and how they can be tracked throughout the application. The various Unicode
encoding forms leave enough space to allow such implementations, but Unicode
will NOT assign code points for binary data which don't have a character
semantic, because this assignment would become characters valid in
plain-text!
So why all these discussions? Because there are various interpretations
about what is or is not "plain-text". As soon as a system import some
plain-text specification but applies some restrictions on it, then it is NOT
plain-text. A filename in a filesystem is NOT plain text because it is only
a restricted subset of what a plain-text can represent. For Unicode and for
ISO/IEC 10646, a plain-text data is ANY sequence of characters in the valid
Unicode range (U+0000 to U+10FFFD minus the surrogates and all
non-characters), in ANY order, or with ANY size. Plain-text also gives a few
mandatory interpretations for some characters, notably the end-of-line
characters (CR, LF, NL, LS, PS...), but no other interpretation for any
other characters, and no limitation on line-lengths (measured in whatever
unit such as bytes, code units, characters, combining sequences, grapheme
clusters, ems, millimeters/picas, pixels, percentage of a container
width...): plain text is just an ordered list of line records, containing
valid characters (including control characters, not to confuse with control
bytes), and optionally terminated by end-of-line character(s).
For more strict definitions of plain-text, you need to create a
specification, and make sure that this specification comes first in the
encapsulation encoding/decoding; if this encapsulation allows any plain-text
to be represented using some escaping mechanism, this mechanism MUST be a
mandatory part of the protocol specification. This is where most of the
legacy filesystems have been failing: their specification is incomplete, or
is simply wrong when they say that filenames are plain-text. In fact, all
filesystems have restrictions on valid filenames, because they also need
that filenames be encapsulted into other text protocols, or even in text
files that have other restrictions (for example within shell commands, or in
URLs, or on single lines), and they don't want that these filenames be
complicate to specify in these external applications (for example,
encapsulating a filename within a URI using the "tricky" URI-escaping
mechanism). But I do think this is a bad argument, made only for lazy
programmers, that often don't use the mandatory parts of these
specifications that document how encapsulation can be safely performed.
Notably, the concept of filenames is a legacy and badly designed concept,
inherited from times where storage space was very limited, and the designers
wanted to create a compact (but often cryptic) representation.
The concept of filenames combines too many independant things:
- a unique moniker used to reference files, and allowing the creation of
links with possible security restrictions.
- a summary identification of the content type (with file extensions present
on most filesystems, including Unix as a nearly universal but unreliable
convention).
- sometimes a version identifier or number (on VMS devices, or on
CDFS/ISO9660), for archival purpose.
- sometimes a data channel identifier (on filesystems that support multiple
data streams with independant datatypes and storage, for the same file, such
as NTFS streams, and MacOS data/resource forks); however this concept is
quite similar to the concept of hierarchical folders that can be used as
valid resources with default contents.
- a description of the content (which is meta-data), but in a way that it is
too much truncated and must nearly always be interpreted relatively to the
description of the directory into which the filename is stored.
--- Lars Kristan wrote: > > Furthermore, I was proposing this concept to be used, but not > > unconditionally. So, you can, possibly even should, keep using > > whatever you are using. > > So you prefer to make programs misbehave in unpredictable ways > (when they pass the data from a component which uses relaxed rules > to a component which uses strict rules) rather than have a clear and > unambiguous notion of a valid UTF-8? I am not particulary thrilled about it. In fact it should be discussed. Constructively. Simply assuming everything will break is not helpful. But if you want an answer, yes, I would go for it. Actually, there are fewer concerns involved than people think. Security is definitely an issue. But again, one shouldn't assume it breaks just like that. Let me risk a bold statement: security is typically implicitly centralized. And if comparison is always done in the same UTF, it won't break. A simple fact that two different UTF-16 strings compare equal in UTF-8 (after relaxed conversion), does not introduce a security issue. Today, two invalid UTF-8 strings compare the same in UTF-16, after a valid conversion (using a single replacement char, U+FFFD) and they compare different in their original form, if you use strcmp. But you probably don't. Either you do everything in UTF-8, or everything in UTF-16. Not always, but typically. If comparisons are not always done in the same UTF, then you need to validate. And not validate while converting, but validate on its own. And now many designers will remember that they didn't. So, all UTF-8 programs (of that kind) will need to be fixed. Well, might as well adopt my broken conversion and fix all UTF-16 programs. Again, of that kind, not all in general, so there are few. And even those would not be all affected. It would depend on which conversion is used where. Things could be worked out. Even if we would start changing all the conversions. Even more so if a new conversion is added and only used when specifically requested. There is cost and there are risks. Nothing should be done hastily. But let's go back and ask ourselves what are the benefits. And evaluate the whole. > > > Perhaps I can convert mine, but I cannot convert all filenames on > > a user's system. > > They you can't access his files. Yes, this is where it all started. I cannot afford not to access the files. I am not writing a notepad. > > With your proposal you couldn't as well, because you don't make them > valid unconditionally. Some programs would access them and some would > break, and it's not clear what should be fixed: programs or filenames. It is important to have a way to write programs that can. And, there is definitely nothing to be fixed about the filenames. They are there and nobody will bother to change them. It is the programs that need to be fixed. And if Unicode needs to be fixed to allow that, then that is what is supposed to happen. Eventually. Lars
This archive was generated by hypermail 2.1.5 : Sat Dec 11 2004 - 17:20:57 CST