(informative) Explanation of Microsoft Windows Text-File Modes

From: Shlomi Tal (shlompi@hotmail.com)
Date: Fri May 31 2002 - 10:00:41 EDT


Another FAQ-like essay of mine. Request for corrections.

---------

Explanation of Microsoft Windows Text-File Modes

by Shlomi Tal (shlompi@hotmail.com)

Contents

1. Concepts
2. ANSI Mode
3. Unicode Mode
4. UTF-8 Mode
----------------------------------------------------------------------

Preliminary note: Windows 9x is shorthand for Microsoft Windows 95, 98
and ME; Windows XP is shorthand for Microsoft Windows NT 4.0, 2000 and
XP.

1. Concepts
^^^^^^^^^^^

The more legacy-free line of Microsoft Windows operating systems are
designed to use Unicode for all text internally, with provision of
other representation modes for text for interoperability with other
environments. The modes are specifically those that appear in the
Windows XP text editor (Notepad), but they apply as general concepts.

Text files can be divided according to the bit-stream representation
they have, and according to the repertoire of characters they
potentially hold. Bit-stream representation is the number and order of
bits and bytes for encoding the text. Repertoire determines what
characters are legal to use in a text file. Bit-stream and repertoire
are closely linked, though the relations are not always
straightforward.

Microsoft Windows can handle text in at least one of three modes:

1. 8-bit stream with 256-character repertoire
2. 16-bit stream with 65536-character repertoire
3. 8-bit stream with 65536-character repertoire

The first is the only option for Windows 9x, and the second is the
native internal mode of Windows XP. The first involves switching the
repertoire by changing 8-bit codepages, whereas the second is fix
16-bit repertoire. The third mode is a hybrid, combining the
65536-character repertoire in a single extended 8-bit codepage.

2. ANSI Mode
^^^^^^^^^^^^
The oldest mode for text files in Microsoft Windows, and the only
option for the Windows 9x family, is ANSI mode, in which the system
recognizes 256 characters. Half of these (the ASCII range, 00 to 7F)
are constant, and the other half (80 to FF) change according to the
particular language version of the system. ANSI modes enable the use
of only two scripts: Basic Latin plus one more codeset. Other codesets
cannot be used in ANSI mode without changing the codepage (which, as
regards Windows 9x, means installing a different version of the
operating system).

In this area there is a notable difference between the "enabled" and
the "localized" versions of Windows 9x. "Enabled" means supporting a
codepage and input methods that make it possible to write in a
particular language. For example, the US version of Windows 9x is also
French enabled, for it has characters for French in the second half of
its codepage (CP1252 in this case). "Localized" means that the whole
interface has been translated to a different language. Localized is
inherently enabled, and there are more different localized versions
than enabled versions.

The practical consequence of ANSI mode is that text files are not
viewed uniformly between operating system versions when characters
from the second half of the codepage are used. For example, German
o-umlaut (U+00F6 LATIN SMALL LETTER O WITH DIAERESIS) will appear as
Hebrew Tsadi (U+05E6 HEBREW LETTER TSADI) when the text file
containing it is opened on an enabled or localized Hebrew Windows 9x
system. This is because German o-umlaut is located on the same integer
in the codepage map of CP1252 as is Hebrew Tsadi in CP1255 (the
MS-Windows Latin/Hebrew codepage). There is no way of entering
o-umlaut in a Hebrew Windows 9x version except through special
applications.

Windows XP abandons ANSI mode and uses Unicode mode instead (see
next), but for compatibility with Windows 9x and other codepage-based
environment it emulates the ANSI mode for one codepage at a time. That
is, an option of "system locale" or "system default language" is
chosen to determine which one of the 8-bit codepages Windows XP
supports. This has the consequence that, for example, German o-umlaut
will appear as Hebrew Tsadi if it is in an ANSI-mode text file and the
system default language is set to Hebrew (more exactly to CP1255). All
Windows 9x applications running on Windows XP will exhibit such
behaviour. This applies mainly to the interface (menus, captions) of
applications.

Windows XP does not use ANSI mode internally, but it can save an
external representation in a text file by saving it as "ANSI". The
file will be saved ordinarily only on condition that it does not
contain any character outside the system's default ANSI codepage. If
it does, then Notepad will trigger a warning to save as Unicode
instead, and further saving will corrupt the original data
(transcoding or conversion to question marks).

3. Unicode Mode
^^^^^^^^^^^^^^^

Windows XP handles text internally as UTF-16 (16 bits per character,
plus support for surrogates from Windows 2000 onwards), and can store
text as UTF-16 in either of little-endian or big-endian byte orders.
The native byte order for the Intel x86 processors is little-endian.

Unicode mode is not a codepage, but a totally different stream method
for text in Windows XP. It is such that typing the command

cmd /u

opens a command prompt in which text is piped in and out as UTF-16
little-endian. Text in Unicode mode can contain any character, and can
be converted to any 8-bit codepage (except for a few such as Hindi and
Georgian which are Unicode only).

The meanings of bytes change when using Unicode mode, for example,
0x03 0xA1 denotes a Greek letter instead of its constituent control
character and symbol. To identify double-bytes as having this meaning,
text files in Unicode mode (either "Unicode", which means UTF-16
little-endian, or "Unicode big endian", which means UTF-16 big-endian)
must have a byte order mark (U+FEFF) prefixed to them. Removing the
BOM results in a return to interpretation of the bytes as 8-bit
codepage byte sequences, and may lead to corruption (see the author's
Microsoft/Unix BOM FAQ for further).

4. UTF-8 Mode
^^^^^^^^^^^^^
UTF-8 mode is a hybrid: Windows XP treats it as an 8-bit codepage,
providing conversion to UTF-16 little-endian internally just as in
ANSI mode, but allows the whole repertoire of Unicode characters.
UTF-8 is a codepage into which Unicode mode text can be converted; it
is not a stream method by itself. That is, Windows XP does not support
"UTF-8 Mode" in itself for any application. For example, the command
prompt is either in 8-bit codepage mode (by starting it with "cmd") or
in 16-bit Unicode mode (by starting it with "cmd /u"), but there is no
UTF-8 mode for the command prompt, although UTF-8 display and input
are supported through codepage 65001.

Windows XP does not provide a way of manipulating UTF-8 strings
directly; it supports UTF-8 by storing it externally (on disk) and
converting it to UTF-16 little-endian for all other operations. This
should not be a problem for interoperability with other environments,
but Unicode-enabled programs for Windows must use UTF-16, not UTF-8
(unlike Unix). Old 8-bit text manipulation tools such as MS-DOS
edit.com can handle UTF-8 strings without corrupting the file; not so
in UTF-16, where just saving the file can transcode its control
character values (such as 0x00 NULL to 0x20 SPACE).

Unlike UTF-16 text files, UTF-8 files do not require a byte order mark
in order to be identified as such. Although Windows XP does prefix the
BOM as a regular procedure, it can be safely removed without
corrupting the file. UTF-8 text is identified in Windows XP
heuristically, that is, by the presence of legal UTF-8 sequences and
absence of illegal sequences.

_________________________________________________________________
Send and receive Hotmail on your mobile device: http://mobile.msn.com



This archive was generated by hypermail 2.1.2 : Fri May 31 2002 - 08:30:56 EDT