RE: Japanese Windows Code Page?

From: Chris Pratley (chrispr@MICROSOFT.com)
Date: Wed Feb 09 2000 - 01:26:33 EST


I'll take a shot at replying to this in a comprehensible way, but I may not
make it out alive...

As on any platform, plain text is just a stream of bytes, and figuring out
what it "really is" is quite difficult. If I gave you a file in ISO-8859-x,
you might have some difficulty determining which one of the flavours it was,
especially if it had very little content, or was almost entirely using
characters form the "ASCII" set. So this isn't a Windows-specific issue.

In the description below, I am going to use the term "ANSI text" (as opposed
to Unicode text) to describe text stored in bytes that should be interpreted
using the codepage associated with the system locale of the particular
Windows system. On a system with a US locale, the system's "ANSI" codepage
is windows-1252. On a Japanese system the ANSI codepage is 932. And so on.

When referring to Notepad, it is important to note the version of Notepad
you are using as it has been upgraded over time, and has always been a
Unicode application on NT (and Windows2000), but never on
Windows9x/Millenium.

Notepad on Win9x/Millenium handles only plain text in the Windows code page
of the system it is running on. But it is an ANSI application, so you are
really just displaying bytes in a particular font. If you apply a font with
a particular charset that is different from the system's ANSI codepage, I
think you can get it to display different characters, but it has been a long
time since I used Notepad on a Win9x machine for multilingual stuff so
perhaps this is not true any more.

NotePad on NT3.x and NT4 is a Unicode (UCS-2 internal) application and it
can open and save plain text in the current system locale's codepage, or
UCS-2 little-endian. I believe that version of Notepad relied on the BOM to
detect Unicode text files. When you save the file, there is a checkbox to
allow you to save as Unicode text.

Notepad on Windows2000 is UTF-16 internally and can open and save the same
things as the older Notepad for NT, but adds UTF-8 and UCS2 Big-Endian. When
you go to save the file, there is a little "encoding" dropdown that lets you
pick among these encodings. UTF-16 is covered as well if you pick either 16
bit "Unicode" flavour (not UTF-8 obviously).

Notepad on win2000 stores the BOM on Unicode files, even with UTF-8, and
uses that to detect the files.

OEM (also knows as "DOS") codepages are rarely used anymore. They may (or
may not - you'd have to get an NT person to confirm) be used in console mode
on NT, but are essentially limited to legacy operations.

Word2000 can open/save plain text files in a wide variety of encodings.
Experiment with the file type "encoded text" when doing an Open or a Save.
Encodings supported include all Windows and DOS-OEM encodings, many ISO
encodings, and some other vendor specific (IBM/Mac/Unix) or regional
encodings (e.g. KOI-8R). The list varies somewhat with the international
support installed on the system.

Word2000 uses a complex algorithm to automatically detect the encoding of a
text or HTML file. Part of the method is algorithmic, and part is
statistical. Accuracy increases with content in the file. Some files are
easy to detect since they have a BOM or other distinctive start sequences -
others are extremely difficult to detect accurately. UTF-8 is actually one
of the easier ones due to the rare byte values it can produce. Users do get
a preview to verify that the automatically detected encoding is accurate,
and can choose another if they like. On save, you can pick the target
encoding and you can also see the characters that will not survive
conversion to the destination encoding marked in red.

Word97 supports saving plain text in the current OEM and Windows ANSI
codepages, based on the system locale setting. Word97 can also save as
Unicode UCS-2 little endian plain text. If you hijack the HTML import, there
is a trick that lets you import plain text in any HTML encoding (place an
<HTML> at the start of the file, turn on Tools/Options/General/Confirm
conversions at open, and pick HTML as the file type when you open the file
and are aske dot confirm. If the encoding is wrong, make the change in
File/Properties).

Wordpad is a wrapper around the rich edit text control, which has also
changed over time. Early versions (riched32.dll) were not Unicode, but
everything based on riched20.dll is. (This includes the older ver 2.0 and
the newer 3.0 found on Win2000). On Win98 I believe riched20.dll as used in
Wordpad attempts to mimic the older riched32.dll for compatibility, but I
could be quite wrong on that (Murray knows best). On NT, Riched20.dll and
Wordpad are fully Unicode. Wordpad allows saving to OEM, Windows ANSI, and
"Unicode" text. I believe the Unicode it supports is only UCS-2
little-endian. Wordpad has changed little in Windows2000 from NT4.

IE5 can also act as a plain text conversion tool. You can open plain text
files in IE5 and then save them to other encodings using File/Save As.

I hope that helped clarify things a little bit. Look for even more encodings
to be supported in the next major release of Word.

Chris Pratley
Group Program Manager
Microsoft Word

-----Original Message-----
From: Frank da Cruz [mailto:fdc@watsun.cc.columbia.edu]
Sent: Monday, February 07, 2000 3:03 PM
To: Unicode List
Subject: Re: Japanese Windows Code Page?

> As far as encoding goes (not considering input or complex
> rendering issues), Word 97 uses Unicode. That is the encoding
> it uses in its internal memory representation, regardless of OS
> it is running on. Ditto for later versions. The encoding form
> it uses is UCS-2; the next version of Word will support UTF-16.
> It can also input and output UTF-8.
>
So (what I meant was) suppose there is a plain-text file, foo.txt (or
foo.doc)
and I open it in the File menu. How do:

  Word
  WordPad
  NotePad

know whether it's Unicode (and if Unicode, whether it's UTF-8 or UTF-16) or
a
Windows Code Page? By inspection / statistical analysis? I can see how
this
might work for telling the difference between a Windows Code Page and
UCS-2/UTF-16 (since the latter would include a lot of NULs), but once you
allow UTF-8 into the picture, it gets muddier, no?

What if the application guesses wrong? Can the user specify the encoding?
A
short tour through the menus of Word 97 and WordPad didn't show any way to
do
this; maybe I missed it, or there is a way in later versions?

Windows has "three worlds" of character sets: OEM code pages, Windows
code pages, and Unicode. I'm trying to get an idea of the degree to which
they overlap, and the degree to which they are distinct and separate:

 . Across applications (Edit, NotePad, WordPad, Word)
 . Across OS's (Win 95, 98, NT, 2000)

For example, does Edit in NT use OEM Code pages? Which versions of
WordPad and/or NotePad can read a Unicode file? etc etc. My motivation for
asking this is to write a clear statement of how plain text should be
imported into Windows from non-Microsoft platforms such as Unix, VMS, etc.
How much does it depend on the application that will be using it, and
the version of the application, and which Windows OS it is.

Thanks again!

- Frank



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:58 EDT