From: Hans Aberg (haberg@math.su.se)
Date: Wed Jan 19 2005 - 17:51:30 CST
At 21:15 +0100 2005/01/19, Lars Kristan wrote:
>Hans Aberg wrote:
>> On 2005/01/19 01:56, Peter Kirk at peterkirk@qaya.org wrote:
>>
>> > On 19/01/2005 00:09, Hans Aberg wrote:
>> >> UTF-8 BOM's seem pointless.
>>
>> > Maybe. Nevertheless, they exist, not only as a result of
>> unintelligent
>> > conversion from UTF-16 or UTF-32 to UTF-8, but also because
>> at least one
>> > UTF-8 editor, Notepad on Windows 2000 (and XP?), always
>> emits a BOM at
>> > the start of a UTF-8 file.
>>
>> Well, it seems easier to change that single editor, then. Or
>> write a program
>> that removes it at need.
>
>At first, one would think that the UTF-8 'BOM' emitted by Notepad is an
>oversight, a bug. But that is not the case.
>
>A long time ago, Notepad worked on 8-bit legacy encoded files. Always in your
current Windows codepage.
>
>Then Notepad was rewritten in Unicode and got the ability to save files in
>'Unicode' (UCS-2). When opening a file, it used the BOM to distinguish the two
>flavors of text files.
>
>Now Notepad got the ability to save UTF-8 files. And the UTF-8 'BOM' is emitted
>for the same purpose - to be able to distinguish the UTF-8 files from legacy
>encoded files. So, you always get the text you saved back, displayed properly.
>But yes, you cannot use Notepad to edit UNIX files, or UTF-8 html files.
>
>It's a question of what Notepad is - is it a plain text editor or is it an
>editor for "Text documents"? From Microsoft perspective it's probably the
>latter, since Windows practically doesn't have any text files at all. Except
>those generated as "Text documents". For everything else (like html), you have
<tools.
It is clear that the program produces files in an inhouse file format for
handling text, and not a plain text format. As the format is platform
specific, when a file is transferred off the platform onto say Internet, the
BOM should be removed in order to become plain text file. Unicode should
have pointed this out to MS. One can compare this for example with Mac OS,
which also uses additional resources to display file information such as
file format, which program is used to handle it, etc. When such a file is
transferred onto the Internet as plain text, all that extra data has to be
removed. Unicode does not provide support for such extra file information
for Mac OS, nor any other platform. So the MS OS should note be treated
specially in this respect.
Hans Aberg
This archive was generated by hypermail 2.1.5 : Wed Jan 19 2005 - 17:52:52 CST