Re: Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Dec 09 2004 - 10:04:13 CST

  • Next message: Philippe Verdy: "Re: Nicest UTF"

    From: "Antoine Leca" <Antoine10646@leca-marti.org>
    > Err, not really. MS-DOS *need to know* the encoding to use, a bit like a
    > *nix application that displays filenames need to know the encoding to use
    > the correct set of glyphs (but constrainst are much more heavy.) Also
    > Windows NT Unicode applications know it, because it can't be changed :-).
    >
    > But when it comes to other Windows applications (still the more common)
    > that
    > happen to operate in 'Ansi' mode, they are subject to the hazard of
    > codepage
    > translations. Even if Windows 'knows' the encoding used for the filesystem
    > (as when it uses NTFS or Joliet, or VFAT on NT kernels; in the other cases
    > it does not even know it, much like with *nix kernels), the only usable
    > set
    > is the _intersection_ of the set used to write and the set used to read;
    > that is, usually, it is restricted to US ASCII, very much like the usable
    > set in *nix cases...

    True, but this applies to FAT-only filesystems, which happen to store
    filenames with a "OEM" charset which is not stored explicitly on the volume.
    This is a known caveat even for Unix, when you look at the tricky details of
    the support of Windows file sharing through Samba, when the client requests
    a file with a "short" 8.3 name, that a partition used by Windows is supposed
    to support.

    In fact, this nightmare comes from the support in Windows of the
    compatibility with legacy DOS applications which don't know the details and
    don't use the Win32 APIs with Unicode support. Note that DOS applications
    use a "OEM" charset which is part of the user settings, not part of the
    system settings (see the effects of the command CHCP in a DOS command
    prompt).

    FAT32 and NTFS help reconciliate these incompatible charsets because these
    filesystems also store a "LFN" (Long File Name) for the same files (in that
    case the short name, encoded in some ambiguous OEM charset, is just an
    alias, acting exactly like a hard link on Unix created in the same directory
    that references the same file). "LFN" names are UTF-16 encoded and support
    mostly the same names as in NTFS volumes.

    However, on FAT32 volumes, the short names are mandatory, unlike on NTFS
    volumes where they can be created "on the fly" by the filesystem driver,
    according to the current user settings for the selected OEM charset, without
    storing them explicitly on the volume. Windows contains, in CHKDSK, a way to
    verify that short names of FAT32 filesystems are properly encoded with a
    coherent OEM charset, using the UTF-16 encoded LFN names as a reference. If
    needed, corrections for the OEM charset can be applied...

    This nightmare of incompatible OEM charsets do happen on Windows 98/98SE/ME,
    when the "autoexec.bat" file that defines the current user profile is not
    executing as it should the proper "CHCP" command, or when this autoexec.bat
    file has been modified or erased: in that case, the default OEM charset
    (codepage 437) is used, and short filenames are incorrectly encoded.

    Another complexity is that Win32 applications, that use a fixed (not
    user-settable) "ANSI" charset, and that don't use the Unicode API depend on
    the conversion from the ANSI charset to the current OEM charset. But if a
    file is handled through some directory shares via multiple hosts, that have
    distinct ANSI charsets (i.e. Windows hosts running different localization of
    Windows, such as a US installation and a French version in the same LAN),
    the charsets viewed by these hosts will create incompatible encodings on the
    same shared volume.

    So the only "stable" subset for short names, that is not affected by OS
    localization or user settings is the intersection of all possible ANSI and
    OEM charsets that can be set in all versions of Windows! No need to say,
    this designates only the printable ASCII charset for short 8.3 names. Long
    filenames are not affected by this problem.

    Conclusion: to use international characters out of ASCII in filenames used
    by Windows, make sure that the the name is not in a 8.3 short format, so
    that a long filename, in UTF-16, will be created on FAT32 filesystems or on
    SMBFS shares (Samba on Unix/Linux, Windows servers)... Or use NTFS (but then
    resolve the interoperability problems with Linux/Unix client hosts that
    can't access reliably, for now, to these filesystems, and that are not
    completely emulated by Unix filesystems used by Samba, due to the limitation
    on the LanMan sharing protocol, and limitations of Unix filesystems as well
    that rarely use UTF-8 as their prefered encoding...)



    This archive was generated by hypermail 2.1.5 : Thu Dec 09 2004 - 10:11:33 CST