Re: (Informational only: UTF-8 BOM and the real life)

From: Leif H Silli <xn--mlform-iua_at_xn--mlform-iua.no>
Date: Sat, 28 Jul 2012 08:52:59 +0300

Steven Atreju on 28/7/'12, 0:22:
> "Doug Ewell" wrote:

> |> Well, i still see a bug in the Unicode Standard here.
> |> Whereas for the multioctet UTFs there is «The BOM is not
> |> considered part of the content of the text» (Conformance, 3.10,
> |> D98, D101), i cannot find any such clarifying text for it's usage
> |> as a signature.
> |
> |There really isn't as much difference between using U+FEFF "as a byte
> |order mark" and using it "as a signature" as this makes it seem. The
> |definitions you quote have to do with whether U+FEFF is treated as a
> |BOM/signature or as a zero-width no-break space.
 
> I really think that a clarification in equal spirit to those of
> D98 and D101 (but maybe with different content :) would be an
> improvement of the Unicode Standard.
>
> Once more i want to point out that on Unix/POSIX systems the file
> content can be seen as a whole, and i hope and think that this
> will not change. This situation is completely different than on
> Windows, which had textfiles with appended (separated by ^Z or so)
> meta information that was invisible in normal text editors already
> in the ninetees (or even earlier, but i don't know).
>
> I.e., this is why we do have this messy text OR binary file I/O
> distinction like O_BINARY (for open(2)), "b" (for fopen(3)) or
> binmode (perl(1)). Because without those a text file will see
> End-Of-File at the ^Z, not at the real end of the file. (Which
> rises the immediate question why the Microsoft programmers did not
> embed the meta information in this section at the end of the file.
> But i don't really want to know.)
> Anyway. On Unix a UTF-8 file *will* show the BOM, because it is
> file content.

I agree with Doug that there is no enormous diff between "BOM" and "encoding signature". In XML 1.0 the BOM is in fact described as a signature regardless of which unicode encoding it is used with:

http://www.w3.org/TR/xml/#charencoding

Also, whether UTF-16 is one ore two encodings is a definition question. (Microsoft at one time defined it as two encodings.)

--
Leif Halvard Silli
Received on Sat Jul 28 2012 - 00:59:04 CDT

This archive was generated by hypermail 2.2.0 : Sat Jul 28 2012 - 00:59:05 CDT