Feedback on the proposal to change U+FFFD generation when decoding ill-formed UTF-8
Alastair Houghton via Unicode
unicode at unicode.org
Tue May 16 10:30:09 CDT 2017
On 16 May 2017, at 14:23, Hans Åberg via Unicode <unicode at unicode.org> wrote:
> You don't. You have a filename, which is a octet sequence of unknown encoding, and want to deal with it. Therefore, valid Unicode transformations of the filename may result in that is is not being reachable.
> It only matters that the correct octet sequence is handed back to the filesystem. All current filsystems, as far as experts could recall, use octet sequences at the lowest level; whatever encoding is used is built in a layer above.
HFS(+), NTFS and VFAT long filenames are all encoded in some variation on UCS-2/UTF-16. FAT 8.3 names are also encoded, but the encoding isn’t specified (more specifically, MS-DOS and Windows assume an encoding based on your locale, which could cause all kinds of fun if you swapped disks with someone from a different country, and IIRC there are some shenanigans for Japan because of the use of 0xe5 as a deleted file marker). There are some less widely used filesystems that require a particular encoding also (BeOS’ BFS used UTF-8, for instance).
Also, Mac OS X and iOS use UTF-8 at the BSD layer; if a filesystem is in use whose names can’t be converted to UTF-8, the Darwin kernel uses a percent encoding scheme(!)
It looks like Apple has changed its mind for APFS and is going with the “bag of bytes” approach that’s typical of other systems; at least, that’s what it appears to have done on iOS.
More information about the Unicode