Re: Names for UTF-8 with and without BOM - pragmatic

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Tue Nov 05 2002 - 16:52:31 EST

  • Next message: Peter_Constable@sil.org: "Re: ct, fj and blackletter ligatures"

    Mark Davis wrote:
    > Little probability that right double quote would appear at the start of a
    > document either. Doesn't mean that you are free to delete it (*and* say that
    > you are not modifying the contents).

    This points to a pragmatic way to deal with this issue:

    If software claims that it does not modify the contents of a document *except* for initial U+FEFF
    then it can do with initial U+FEFF what it wants. If the whole discussion hinges on what is allowed
    <em>if software claims to not modify text</em> then one need not claim that so absolutely.

    Similarly, software may claim to not modify text contents _except_ that it may transform line
    endings into LS or any other convention.

    Not all software claims to not modify text, nor needs to claim that, and a lot of software does
    modify text.

    > I agree that when the UTC decides that a BOM is *only* to be used as a
    > signature, and that it would be ok to delete it anywhere in a document (like
    > a non-character), then we are in much better shape. This was, as a matter of
    > fact proposed for 3.2, but not approved. If we did that for 4.0, then there
    > would be much less reason to distinguish UTF-8 'withBOM' from UTF-8
    > 'withoutBOM'.

    This would be good. The above would still be useful.

    Joseph's request is actually different from the discussion of what is "the right thing": He mostly
    wants to have labels that distinguish between different things to be done. If there is no consensus
    for such labels here, then Joseph may need to use in his configuration file selectors that are
    separate from charset labels.

    For example:

    Type charset BOM Comment
    .txt UTF-8 require We want plain text files to
                            have BOM to distinguish
                            from legacy codepage files
    .xml UTF-8 forbid Some XML processors may not cope with BOM
    .htm UTF-8 maybe We want HTML to be UTF-8 but
                            will not insist on BOM
    .rc not UTF n/a Unfortunately compiler insists on
                            these being codepage.
    .rc UTF-16 require Alternative to the previous line.
    .swt ASCII n/a Nonlocalizable internal format, must be ASCII.

    markus

    -- 
    Opinions expressed here may not reflect my company's positions unless otherwise noted.
    


    This archive was generated by hypermail 2.1.5 : Tue Nov 05 2002 - 17:32:40 EST