Re: (informative) Explanation of Microsoft Windows Text-File Modes

From: Michael \(michka\) Kaplan (michka@trigeminal.com)
Date: Fri May 31 2002 - 10:38:54 EDT


From: "Shlomi Tal" <shlompi@hotmail.com>

> Another FAQ-like essay of mine.

Very interesting....

> Request for corrections.

Ok, if you insist. :-)

> Microsoft Windows can handle text in at least one of three modes:
>
> 1. 8-bit stream with 256-character repertoire
> 2. 16-bit stream with 65536-character repertoire
> 3. 8-bit stream with 65536-character repertoire

#1 fails to take into account CJK "ANSI" code pages, which support a lot
more than 256 characters. Also, if you move beyond notepad into text editors
that allow saving into different encodings, there is even gb18030.

> 2. ANSI Mode
> ^^^^^^^^^^^^
> The oldest mode for text files in Microsoft Windows, and the only
> option for the Windows 9x family, is ANSI mode, in which the system
> recognizes 256 characters. Half of these (the ASCII range, 00 to 7F)
> are constant, and the other half (80 to FF) change according to the
> particular language version of the system. ANSI modes enable the use
> of only two scripts: Basic Latin plus one more codeset. Other codesets
> cannot be used in ANSI mode without changing the codepage (which, as
> regards Windows 9x, means installing a different version of the
> operating system).

See above -- DBCS code pages cannot be denied...

> Windows XP abandons ANSI mode and uses Unicode mode instead (see
> next), but for compatibility with Windows 9x and other codepage-based
> environment it emulates the ANSI mode for one codepage at a time.

XP abandons? The abandonment started in NT 3.1, and continued with NT 3.5,
NT 3.51, NT 4.0, Windows 2000, Windows XP, and Windows .Net Server.

Now I know you had a prelim note, but you are missing more than half of the
relevant products.

You might want to consider using "NT" or "WinNT" for the shorthand rather
than XP/WinXP -- this is much more common usage. If you just say "XP" maybe
you mean Office XP? NT and 9x are clearly referring to Windows platforms,
though.

> opens a command prompt in which text is piped in and out as UTF-16
> little-endian. Text in Unicode mode can contain any character, and can
> be converted to any 8-bit codepage (except for a few such as Hindi and
> Georgian which are Unicode only).

This part needs a little work. It is not really true that text can be
converted to *any* code page, since most characters outside of ASCII will be
converted to "?" in most code pages. Unicode only languages have no code
pages to convert to -- though note that there are the ISCII code pages which
can convert Indic languages to an 8-bit code page.

MichKa

Michael Kaplan
Trigeminal Software, Inc. -- http://www.trigeminal.com/



This archive was generated by hypermail 2.1.2 : Fri May 31 2002 - 09:00:26 EDT