Re: MS/Unix BOM FAQ again (small fix)

From: Mark Davis (mark@macchiato.com)
Date: Tue Apr 09 2002 - 15:15:54 EDT


> A Unicode text file beginning with FEFF is
> big-endian, and a file beginning with FFFE (not a legal Unicode
> character for any other purpose) is little-endian.

This is incorrect. Here is a summary of the meaning of those bytes at
the start of text files with different Unicode encoding forms.

beginning with bytes FE FF:
- UTF-16 => big endian, omitted from contents
- UTF-16BE => ZWNBSP
- UTF-16LE, UTF-8, UTF-32, UTF-32BE, UTF32LE => malformed, file
corrupted

beginning with bytes FF FE:
- UTF-16 => little endian, omitted from contents
- UTF-16LE => ZWNBSP
- UTF-32 => little endian (if followed by bytes 00 00), omitted from
contents
- UTF-32LE => different code points, depending on following bytes
- UTF-16BE, UTF-8, UTF-32BE => malformed, file corrupted

> In addition, a Unicode encoding scheme named UTF-7, which was meant
as

Worth mentioning that SCSU also has a BOM.

Mark

—————

Γνῶθι σαυτόν — Θαλῆς
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

----- Original Message -----
From: "Shlomi Tal" <shlompi@hotmail.com>
To: <unicode@unicode.org>
Sent: Tuesday, April 09, 2002 10:43
Subject: MS/Unix BOM FAQ again (small fix)

> A small fix for the FAQ; specifically, a fix for the typo/braino of
> construing 0x071F as little-endian 1F 70 instead of (the now fixed)
1F 07.
> Thanks to Wladislaw Vaintroub for pointing it out for me.
>
> --- BEGIN ---
>
> Microsoft Unicode Text File Byte Order Mark (BOM) FAQ
>
> by Shlomi Tal (shlompi@hotmail.com)
>
> Contents
>
> 1. What is a BOM?
> 2. Why does it matter?
> 3. Is the BOM mandatory or optional?
> --------------------------------------------------------------------
-
>
> 1. What is a BOM?
> ^^^^^^^^^^^^^^^^^
>
> BOM, or Byte-Order Mark, is a signature at the beginning of a
Unicode
> text file. Since different processors handle sequences of bytes in a
> particular way, the BOM is used to mark which byte-order the text
file
> was written in.
>
> Processors are either big-endian or little-endian. The former put
the
> most significant byte first, and the latter put the least
significant
> byte first. So that the 16-bit number 0x071F is serialized as:
>
> Big-endian 07 1F
> Little-endian 1F 07
>
> Obviously a code with the value 0x071F will be interpreted as 0x1F07
> if it passes from a processor of different byte-order without
> information about its original state. This is what the Unicode BOM
> seeks to avoid.
>
> The Unicode standard permits the character U+FEFF (Zero-Width
> Non-Breaking Space) at the beginning of the file as a mark for the
> byte order of the file. A Unicode text file beginning with FEFF is
> big-endian, and a file beginning with FFFE (not a legal Unicode
> character for any other purpose) is little-endian.
>
> All this is relevant to the 16-bit and 32-bit encodings of Unicode
> characters - UTF-16 and UTF-32 respectively. Thus:
>
> FE FF is UTF-16 Big-Endian
> FF FE is UTF-16 Little-Endian
> 00 00 FE FF is UTF-32 Big-Endian
> FF FE 00 00 is UTF-32 Little-Endian
>
> There is another, very common Unicode encoding scheme called UTF-8,
> which maps the Unicode repertoire into sequences of bytes. Since the
> order of bytes (as opposed to words of more than one byte) is the
same
> for all processors, UTF-8 does not require a BOM. It can have one,
> though.
>
> In addition, a Unicode encoding scheme named UTF-7, which was meant
as
> a mail-safe encoding but is now nearly obsolete, can have a BOM as
> well. Here too the BOM is not mandatory.
>
> 2. Why does it matter?
> ^^^^^^^^^^^^^^^^^^^^^^
>
> It matters because Microsoft tools (most prominently Windows
Notepad)
> prefix the BOM to Unicode text files regularly, whereas other
systems
> and environments (Unix, Linux, web pages) are better off without the
> BOM, especially in the case of UTF-8 text files.
>
> Unix systems, for example, search for an initial #! in a shell
script
> file in order to determine the interpreter for it. An initial BOM
> coming instead of the #! could easily disrupt this convention. Also,
> and this applies particularly to databases, and not only in Unix,
the
> BOM can cause disorder when files are merged. Web pages usually use
> UTF-8, and although they can handle the BOM, it may appear as a
> strange character (a blank square or a question mark) on a browser
> that doesn't recognize it, and may also cause the above troubles
when
> the file is saved to the local disk.
>
> Most of the Unicode text meant for open transfer between various
> systems (and the Web) is encoded in UTF-8. Unix systems regularly
form
> UTF-8 text files without the BOM, but Windows systems prefix the BOM
> as usual. Here follows an explanation of when the Unicode BOM can or
> cannot be removed from text files on Microsoft Windows systems.
>
> 3. Is the BOM mandatory or optional?
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Microsoft Windows, beginning with the Unicode-supporting operating
> systems Windows 2000 and Windows XP, can handle UTF-16
Little-Endian,
> UTF-16 Big-Endian, UTF-8 and old 8-bit "ANSI" (Microsoft's
> non-standard name for its 8-bit Windows codepages, consisting of the
> ASCII repertoire for the first 128 characters and varying characters
> for the other 128). The native encoding for these systems is UTF-16
> Little-Endian, which when saving under Notepad is called "Unicode".
> UTF-16 Big-Endian is called "Unicode Big-Endian", and UTF-8 keeps
its
> name.
>
> Upon saving a Unicode text file in Notepad, the BOM is always
> prefixed. Thus, opening such a file with a text editor which is not
> Unicode-aware (such as edit.com) or doing a hexdump on it, you will
> see UTF-16 Little-Endian ("Unicode") starting with FF FE, UTF-16
> Big-Endian ("Unicode Big-Endian") starting with FE FF, and UTF-8
> starting with the UTF-8 encoding of the BOM: EF BB BF.
>
> For the first two encoding schemes (UTF-16), the user MUST NOT
remove
> the BOM manually. Removing the BOM using an external tool (such as
> edit.com) and then opening the file with Notepad will reveal a pile
of
> gibberish. Then, saving the file will corrupt it beyond recovery.
This
> is because the BOM is necessary for the system to read the 16-bit
> values as they are and ignore their values as 8-bit sequences.
Without
> the BOM, an 8-bit sequence forming part of a 16-bit Unicode
character
> will be given its special ASCII value, which may be a control
> character. Many of these are transcoded into graphic ASCII
characters
> when the file is saved again, and thus the original text is lost.
> Since UTF-16 text files are not meant for open transfer anyway, this
> is not an important issue. As for database applications and other
> situations where text files are merged, a Unicode-aware application
> should be able to discard all following U+FEFF characters.
>
> For UTF-8, Windows Notepad prefixes the sequence EF BB BF, but it is
> not mandatory. The sequence does not signal byte-order, but just
that
> the file is in UTF-8 encoding, and strictly speaking is not
necessary
> at all. In fact, Notepad can identify a text file as UTF-8 if it
> contains no illegal UTF-8 sequences. One Latin-1 accented European
> vowel standing alone in the text already prevents the text from
being
> recognized as UTF-8. See for yourself: type ALT+0206 ALT+0177 (that
> is, those numbers with the ALT key held) on an empty text file, save
> and close it. The next time you open the file you will see a Greek
> small letter alpha in it - the file has been converted to UTF-8,
> though the BOM has not yet been added. Writing more and saving the
> file a second time will cause the BOM to be prefixed.
>
> Thus, when writing UTF-8 files for open transfer, it is best to keep
> the BOM until the text file is complete, and then the BOM can be
> safely removed (the author does so for all his HTML files: writing
> with the BOM until completion, then removing it using the Vim
editor,
> which since version 6.0 can handle UTF-8). Upon making further
changes
> to the file, remember to remove the BOM again.
>
> So the rules are:
>
> 1) Do not remove the BOM (FF FE or FE FF) from UTF-16 files.
> 2) Removing the BOM (EF BB BF) from UTF-8 is allowed.
>
> Finally, as a side note, and not of any importance, UTF-7 files can
> have a BOM too: 2B 2F 76 38 2D (ASCII +/v8-). UTF-7 files are no
> special type under Windows, they are saved as "ANSI", as if they
were
> regular ASCII or Latin-1 text. The UTF-7 BOM is useful only for
> testing a UTF-7 encoded text file when dragging it into Internet
> Explorer (5 and upwards), which recognizes the BOM and promptly sets
> its encoding to UTF-7. However, given that the UTF-7 encoding has so
> little use (in our day of 8-bit clean systems, which let data with
the
> high bit on pass uncorrupted), this can only serve as a piece of
> trivia.
>
> --- END ---
>
>
> _________________________________________________________________
> Join the worlds largest e-mail service with MSN Hotmail.
> http://www.hotmail.com
>
>
>



This archive was generated by hypermail 2.1.2 : Tue Apr 09 2002 - 16:26:53 EDT