Re: Names for UTF-8 with and without BOM - pragmatic

From: Markus Scherer (markus.scherer@jtcsv.com)
Date: Tue Nov 05 2002 - 16:52:31 EST

Next message: Peter_Constable@sil.org: "Re: ct, fj and blackletter ligatures"

Previous message: David Hopwood: "Re: In defense of Plane 14 language tags (long)"
In reply to: Mark Davis: "Re: Names for UTF-8 with and without BOM"
Next in thread: William Overington: "Re: Names for UTF-8 with and without BOM"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Mark Davis wrote:
> Little probability that right double quote would appear at the start of a
> document either. Doesn't mean that you are free to delete it (*and* say that
> you are not modifying the contents).

This points to a pragmatic way to deal with this issue:

If software claims that it does not modify the contents of a document *except* for initial U+FEFF
then it can do with initial U+FEFF what it wants. If the whole discussion hinges on what is allowed
<em>if software claims to not modify text</em> then one need not claim that so absolutely.

Similarly, software may claim to not modify text contents _except_ that it may transform line
endings into LS or any other convention.

Not all software claims to not modify text, nor needs to claim that, and a lot of software does
modify text.

> I agree that when the UTC decides that a BOM is *only* to be used as a
> signature, and that it would be ok to delete it anywhere in a document (like
> a non-character), then we are in much better shape. This was, as a matter of
> fact proposed for 3.2, but not approved. If we did that for 4.0, then there
> would be much less reason to distinguish UTF-8 'withBOM' from UTF-8
> 'withoutBOM'.

This would be good. The above would still be useful.

Joseph's request is actually different from the discussion of what is "the right thing": He mostly
wants to have labels that distinguish between different things to be done. If there is no consensus
for such labels here, then Joseph may need to use in his configuration file selectors that are
separate from charset labels.

For example:

Type charset BOM Comment
.txt UTF-8 require We want plain text files to
                        have BOM to distinguish
                        from legacy codepage files
.xml UTF-8 forbid Some XML processors may not cope with BOM
.htm UTF-8 maybe We want HTML to be UTF-8 but
                        will not insist on BOM
.rc not UTF n/a Unfortunately compiler insists on
                        these being codepage.
.rc UTF-16 require Alternative to the previous line.
.swt ASCII n/a Nonlocalizable internal format, must be ASCII.

markus

-- 
Opinions expressed here may not reflect my company's positions unless otherwise noted.

Next message: Peter_Constable@sil.org: "Re: ct, fj and blackletter ligatures"
Previous message: David Hopwood: "Re: In defense of Plane 14 language tags (long)"
In reply to: Mark Davis: "Re: Names for UTF-8 with and without BOM"
Next in thread: William Overington: "Re: Names for UTF-8 with and without BOM"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Nov 05 2002 - 17:32:40 EST