Re: Names for UTF-8 with and without BOM - pragmatic

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Wed Nov 06 2002 - 12:47:43 EST

Next message: William Overington: "A .notdef glyph (derives from Re: ct, fj and blackletter ligatures)"

Previous message: Kent Karlsson: "RE: Names for UTF-8 with and without BOM - pragmatic"
In reply to: Lars Kristan: "RE: Names for UTF-8 with and without BOM - pragmatic"
Next in thread: Kent Karlsson: "RE: Names for UTF-8 with and without BOM - pragmatic"
Reply: Kent Karlsson: "RE: Names for UTF-8 with and without BOM - pragmatic"
Reply: David Starner: "Re: Names for UTF-8 with and without BOM - pragmatic"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Lars Kristan wrote:
> Markus Scherer wrote:
>
>>If software claims that it does not modify the contents of a
>>document *except* for initial U+FEFF
>>then it can do with initial U+FEFF what it wants. If the
>>whole discussion hinges on what is allowed
>><em>if software claims to not modify text</em> then one need
>>not claim that so absolutely.
>
> That seems pretty straightforward, but only as long as your "software" is an
> editor and your "document" is a single file. How about a case where
> "software" is a copy or cat command, and instead of a document you have
> several (plain?) text files that you concat? What does "initial" mean here?

Initial for each piece, as each is assumed to be a complete text file before concatenation. Nothing
prevents copy/cp/cat and other commands from recognizing Unicode signatures, for as long as they
don't claim to preserve initial U+FEFF.

> What happens next is: some software lets an initial BOM get through and
> appends such string to a file or a stream. If other software treats it as a
> character, the data has been modified. On the other hand, if we want to
> allow software to disregard BOMs in the middle of character streams then we
> have another set of security issues. And not removing is equally bad because
> of many consequences (in the end, we could end up with every character being
> preceded by a BOM).

All true, and all well known, and the reason why the UTC and WG2 added U+2060 Word Joiner. This
becomes less of an issue if and when they decide to remove/deprecate the ZWNBSP semantics from U+FEFF.

However, in a situation where you cannot be sure about the intended purpose of an initial U+FEFF I
think that the "pragmatic" approach is any less safe than any other, while it increases usability.

>>.txt UTF-8 require We want plain text files to
>> have BOM to distinguish
>> from legacy codepage files
>
> Hmmmm, what does "plain" mean?! ...

Your response to this takes it out of context. I am not trying to prescribe general semantics of
.txt plain text files.

If you read the thread carefully, you will see that I am just taking the file checker configuration
file from Joseph Boyle and suggesting a modification to its format that makes it not rely on having
charset names that indicate any particular BOM handling. I am sorry to not have made this clearer.

> True, UTF-16 files do need a signature. Well, we just need to abandon the
> idea that UTF-16 can be used for plain text files. Let's have plain text
> files in UTF-8. Look at it as the most universal code page. Plain text files
> never contained information about the code page, why would there be such
> information in UTF-8 plain text files?!

UTF-16 files do not *need* a signature per se. However, it is very useful to prepend Unicode plain
text *files* with Unicode signatures so that tools have a chance to figure out if those files are in
Unicode at all - and which Unicode charset - or in some legacy charset. With "plain text files" I
mean plain text documents without any markup or other meta information.

The fact is that Windows uses UTF-8 and UTF-16 plain text files with signatures (BOMs) very simply,
gracefully, and successfully. It has applied what I called the "pragmatic" approach here for about
10 years. It just works.

markus

-- 
Opinions expressed here may not reflect my company's positions unless otherwise noted.

Next message: William Overington: "A .notdef glyph (derives from Re: ct, fj and blackletter ligatures)"
Previous message: Kent Karlsson: "RE: Names for UTF-8 with and without BOM - pragmatic"
In reply to: Lars Kristan: "RE: Names for UTF-8 with and without BOM - pragmatic"
Next in thread: Kent Karlsson: "RE: Names for UTF-8 with and without BOM - pragmatic"
Reply: Kent Karlsson: "RE: Names for UTF-8 with and without BOM - pragmatic"
Reply: David Starner: "Re: Names for UTF-8 with and without BOM - pragmatic"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Nov 06 2002 - 13:33:54 EST