When a Major Software Company, which sells the Well-Known Operating
System that I and a few other people use and develop for, decides to add
character-encoding metadata to the file system of that OS, and when
versions of that file system that support encoding metadata are
widespread enough that I no longer need to target my apps to previous
versions, then I too will consider encoding detection to be a thing of
the past.
-- Doug Ewell | Thornton, Colorado, USA http://www.ewellic.org | @DougEwell -----Original Message----- From: Philippe Verdy Sent: Friday, October 19, 2012 16:45 To: Doug Ewell Cc: Stephan Stiller ; unicode_at_unicode.org Subject: Re: texteditors that can process and save in different encodings 2012/10/20 Doug Ewell <doug_at_ewellic.org>: > Suppose I have a file called 'karenina.txt' on my flash drive. Let's > assume we can trust from the .txt extension that it really is a text > file of some sort (that is metadata). Now, what encoding is this file > in? May be you can't know that, may be the filessystem still stored that information (it can do that independantly of the given and visible filename). > See Stephan's comment again about the editor doing charset > detection. I don't like charset detection at all. I'm a strong supporter of separately stored metadata. It is always possible in all filesystems, even if this requires a convention for organizing the content of that filesystem. > Right, but you talked about "saving them as ASCII (i.e. saving this > charset information in the metadata)". This is explicit metadata, not > the implicit type that you're talking about now. Why? He saves in ASCII because this is what the editor will perform. There's not necessarily a choice for it, the storage as ASII will still occur even if the editor does not store *itself* that metadata along with the file content and at the same time (the user may store itsefl the metadata needed for later processings in other tools or by other users to avoid wrong "guesses", even these hazardous guesses performed by automatic charset detectors, that I absolutely don't like at all as they will always fail silently, sooner or later, with wrong guesses). As I'm a strong supporter of metadata, these metadata should never be ignored by editors where they are accessible (and notably when they are part of the storage properties and capabilities of the filesystem). Each time a user needs to reprovide itself the missing metadata, using his own guesses, or using some automatic detector implemented in his software, this will inevitably break. Just like you want to work on a file only once, and encode it only once, you should never depend later on future guesses, even if (and notably when) the file is later transparently to a more convenient encoding for some other editors or tools. The metadata is as much important to preserve and transmit as the content. A "text file" without the specification of its metadata about how it is encoded is absolutely not "plain text" for me. It's just a binary stream, even if it has a "file name" or a basic extension (like ".txt") that does not specify correctly how to read and process it.Received on Sat Oct 20 2012 - 16:44:18 CDT
This archive was generated by hypermail 2.2.0 : Sat Oct 20 2012 - 16:44:20 CDT