Re: Names for UTF-8 with and without BOM

From: Mark Davis (mark.davis@jtcsv.com)
Date: Sun Nov 03 2002 - 15:25:14 EST

Next message: Michael \(michka\) Kaplan: "Re: Names for UTF-8 with and without BOM"

Previous message: Mark Davis: "Re: Header Reply-To"
In reply to: Doug Ewell: "Re: Names for UTF-8 with and without BOM"
Next in thread: Michael \(michka\) Kaplan: "Re: Names for UTF-8 with and without BOM"
Reply: Michael \(michka\) Kaplan: "Re: Names for UTF-8 with and without BOM"
Reply: Doug Ewell: "Re: Names for UTF-8 with and without BOM"
Reply: Markus Scherer: "Re: Names for UTF-8 with and without BOM - pragmatic"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Little probability that right double quote would appear at the start of a
document either. Doesn't mean that you are free to delete it (*and* say that
you are not modifying the contents).

I agree that when the UTC decides that a BOM is *only* to be used as a
signature, and that it would be ok to delete it anywhere in a document (like
a non-character), then we are in much better shape. This was, as a matter of
fact proposed for 3.2, but not approved. If we did that for 4.0, then there
would be much less reason to distinguish UTF-8 'withBOM' from UTF-8
'withoutBOM'.

Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄

----- Original Message -----
From: "Doug Ewell" <dewell@adelphia.net>
To: "Unicode Mailing List" <unicode@unicode.org>
Cc: "Mark Davis" <mark.davis@jtcsv.com>; "Murray Sargent"
<murrays@exchange.microsoft.com>; "Joseph Boyle" <Boyle@siebel.com>
Sent: Saturday, November 02, 2002 13:27
Subject: Re: Names for UTF-8 with and without BOM

> Mark Davis <mark dot davis at jtcsv dot com> wrote:
>
> > That is not sufficient. The first three bytes could represent a real
> > content character, ZWNBSP or they could be a BOM. The label doesn't
> > tell you.
>
> I have never understood under what circumstances a ZWNBSP would ever
> appear as the first character of a file. It wouldn't make any sense. A
> ZWNBSP prevents a word break between the preceding and following
> characters. If there *is* no preceding character, then what is the
> point of the ZWNBSP?
>
> Every time this topic comes up, I have asked why a true ZWNBSP would
> ever appear as the first character of a file. The only responses I've
> heard are:
>
> 1. It might not be a discrete file, but the second (or successive)
> piece of a file that was split up for some reason (transmission, etc.).
>
> In that case, the interpreting process should take its encoding cue from
> the first fragment, and should NEVER reinterpret fragments broken up at
> arbitrary points. (Imagine a process modifying a GIF or JPEG file, or
> converting CR/LF, based on fragments!) But this is not the point being
> discussed anyway; the point is whole files.
>
> 2. It could happen; Unicode allows any character to appear anywhere.
>
> Well, almost anywhere. But even so, the likelihood of a U+FEFF as
> ZWNBSP appearing at the start of an unsigned UTF-8 file is vanishingly
> small compared to the likelihood that the U+FEFF was intended to be a
> signature. The rare case is just too rare to invalidate the heuristic
> for the much more common case.
>
> In addition, as Michka points out, we now have U+2060 WORD JOINER, whose
> entire purpose in life is to be used as U+FEFF was formerly used, as a
> ZWNBSP. Any new Unicode text should use U+2060 and not U+FEFF as a word
> joiner. It's hard to imagine that UTC and WG2 would have standardized
> this if there was a lot of real-world text that used U+FEFF as ZWNBSP.
>
> -Doug Ewell
> Fullerton, California
>
>
>

Next message: Michael \(michka\) Kaplan: "Re: Names for UTF-8 with and without BOM"
Previous message: Mark Davis: "Re: Header Reply-To"
In reply to: Doug Ewell: "Re: Names for UTF-8 with and without BOM"
Next in thread: Michael \(michka\) Kaplan: "Re: Names for UTF-8 with and without BOM"
Reply: Michael \(michka\) Kaplan: "Re: Names for UTF-8 with and without BOM"
Reply: Doug Ewell: "Re: Names for UTF-8 with and without BOM"
Reply: Markus Scherer: "Re: Names for UTF-8 with and without BOM - pragmatic"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sun Nov 03 2002 - 15:59:57 EST