From: Philippe Verdy (email@example.com)
Date: Sun Nov 27 2005 - 11:16:45 CST
From: "Marcin 'Qrczak' Kowalczyk" <firstname.lastname@example.org>
> - When converting from Unicode to byte strings, only U+0000 followed
> by another U+0000 or by a character between U+0080 and U+00FF gets
> converted. U+0000 followed by any other character is an error.
> - This encoding has the properties that any byte string converted
> to Unicode will yield back to the same byte string, that any valid
> UTF-8 byte string not containing 0x00 will be converted to the
> same Unicode string as with UTF-8, and that any Unicode string not
> containing U+0000 will be converted to the same byte string as with
If you want to keep the compatibility with null-ended byte strings, may be
the alternative using really non-character code points might help. So I
would use something like U+FFFE followed by a codepoint in U+0080..U+00FF.
But this means that the *valid* UTF-8 encoded byte string that represents
U+FFFF would have to be escaped when converted to a string of code points,
and it may still happen that a valid string of code points containing U+FFFE
could be stored in a byte string to create filenames.
Really, you cannot reach a full bijection for those cases: as soon as you
know that a string of valid code points can contain any occurence of U+0000
or U+FFFE (internally stored as 16-bit or 32-bit code units or even with
UTF-8, it does not matter here), you're working in a gray area where your
algorithm is not working only with characters. We speak there about code
point strings which is effectively a superset of Unicode character strings.
So the goodquestion for designing any API is to ask whever it has to handle
only characters, or codepoints. Regarding the Unix filesystem APIs, it is
clear that it does not work at the character level (like Windows), but at
the byte stream level which defines its own superset of the code point
string level. If you realize that this byte stream level is a superset,
there's simply no way to create a bijection even with the code point string
level. You're definitely in a gray area where unicity cannot be guaranteed.
And yes this creates a security risk as soon as you perform a conversion
from code point strings to byte streams, i.e. when trying to access the
filesystem from a valid code point string. The only way to avoid such risk
is to restrict the access to the filesystem by only allowing code point
strings that are valid character strings.
This effectively means that users of that interface won't be able to access
to every file on the filesystem, and only administrators of that system will
have the tools to interact with it at the byte stream level, to manage the
case of existing filenames with invalid UTF-8 sequences: this could be
performed by tools like "fsck" run by sys-admins on Unix/Linux that will
correct these filenames to enforce this security, by renaming them into
non-conflicting names (possibly with a leading ".#" prefix to "hide" them in
user interfaces, and with an extra numeric extension in case of conflict).
So I see absolutely no need to add more complexity to programs, and what
Java does looks very valid in this perspective. Personnaly I see no interest
for making programs more complex. They should be written to treat filenames
as character strings, not byte strings. This means that APIs that read
directory entries should silently discard and ignore the discovered names
that are incorrectly encoded (not trying to disguise them as these files
won't be openable or deletable under these modified names!), and the API
that attempts to delete what seems to be anempty directory should simply
return an error if there remains a file (all programs should already be
ready to handle such error, because filesystems can be used concurrently by
other users or programs that could link perfectly valid filenames into the
Really don't try to disguise things like you do: you add new security risks
instead of palliating it. It's up to the filesystem or system tools to
assert that filenames are correctly encoded as they should. A program
running in a UTF-8 locale would then support no error, and if it runs from
another locale, it should detect that and already be prepared to the fact
that it won't be able to see and handle all files present on a filesystem.
Don't forget that even with no encoding errors in a filesystem, all programs
should be ready to support the fact that they won't see all files present in
a filesystem, due to user access restrictions.
This archive was generated by hypermail 2.1.5 : Sun Nov 27 2005 - 11:20:51 CST