Re: Names for UTF-8 with and without BOM - pragmatic

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Wed Nov 06 2002 - 12:47:43 EST

  • Next message: William Overington: "A .notdef glyph (derives from Re: ct, fj and blackletter ligatures)"

    Lars Kristan wrote:
    > Markus Scherer wrote:
    >
    >>If software claims that it does not modify the contents of a
    >>document *except* for initial U+FEFF
    >>then it can do with initial U+FEFF what it wants. If the
    >>whole discussion hinges on what is allowed
    >><em>if software claims to not modify text</em> then one need
    >>not claim that so absolutely.
    >
    > That seems pretty straightforward, but only as long as your "software" is an
    > editor and your "document" is a single file. How about a case where
    > "software" is a copy or cat command, and instead of a document you have
    > several (plain?) text files that you concat? What does "initial" mean here?

    Initial for each piece, as each is assumed to be a complete text file before concatenation. Nothing
    prevents copy/cp/cat and other commands from recognizing Unicode signatures, for as long as they
    don't claim to preserve initial U+FEFF.

    > What happens next is: some software lets an initial BOM get through and
    > appends such string to a file or a stream. If other software treats it as a
    > character, the data has been modified. On the other hand, if we want to
    > allow software to disregard BOMs in the middle of character streams then we
    > have another set of security issues. And not removing is equally bad because
    > of many consequences (in the end, we could end up with every character being
    > preceded by a BOM).

    All true, and all well known, and the reason why the UTC and WG2 added U+2060 Word Joiner. This
    becomes less of an issue if and when they decide to remove/deprecate the ZWNBSP semantics from U+FEFF.

    However, in a situation where you cannot be sure about the intended purpose of an initial U+FEFF I
    think that the "pragmatic" approach is any less safe than any other, while it increases usability.

    >>.txt UTF-8 require We want plain text files to
    >> have BOM to distinguish
    >> from legacy codepage files
    >
    > Hmmmm, what does "plain" mean?! ...

    Your response to this takes it out of context. I am not trying to prescribe general semantics of
    .txt plain text files.

    If you read the thread carefully, you will see that I am just taking the file checker configuration
    file from Joseph Boyle and suggesting a modification to its format that makes it not rely on having
    charset names that indicate any particular BOM handling. I am sorry to not have made this clearer.

    > True, UTF-16 files do need a signature. Well, we just need to abandon the
    > idea that UTF-16 can be used for plain text files. Let's have plain text
    > files in UTF-8. Look at it as the most universal code page. Plain text files
    > never contained information about the code page, why would there be such
    > information in UTF-8 plain text files?!

    UTF-16 files do not *need* a signature per se. However, it is very useful to prepend Unicode plain
    text *files* with Unicode signatures so that tools have a chance to figure out if those files are in
    Unicode at all - and which Unicode charset - or in some legacy charset. With "plain text files" I
    mean plain text documents without any markup or other meta information.

    The fact is that Windows uses UTF-8 and UTF-16 plain text files with signatures (BOMs) very simply,
    gracefully, and successfully. It has applied what I called the "pragmatic" approach here for about
    10 years. It just works.

    markus

    -- 
    Opinions expressed here may not reflect my company's positions unless otherwise noted.
    


    This archive was generated by hypermail 2.1.5 : Wed Nov 06 2002 - 13:33:54 EST