RE: Developing multilingual web sites

From: Chris Pratley (chrispr@MICROSOFT.com)
Date: Wed Mar 22 2000 - 00:38:05 EST


Aaron,

You have a few options.

One is to use a text editor that can support output in many different text
encodings. There are a few of these around, and one of these is Microsoft
Word2000. If you treat Word as a plain text editor, and save as "Encoded
text" (*.txt), you can edit text files in any encoding. There are no doubt
many other text editors on the level of Notepad or more that will do the
same thing, since plain text editing is fairly trivial. The feature most
editors lack is the ability to save to many other encodings (usually the
editor needs to be a Unicode editor in order to do this), but I think a
handful of them do have this feature. People on the list can probably
suggest a few if you don't have Word2000.

Once you save a file as plain text (txt), it becomes merely a stream of
bytes with no information regarding its encoding (this is why it is called
"plain" text - it is really just bytes with no other meta-information such
as encoding, fonts, etc.). When the file is loaded into an editor, what
happens depends on the editor. Notepad, being a Unicode application (on
NT/Win2000), will attempt to convert the bytes to Unicode, making the
assumption that the bytes are in the ANSI code page of the system. So these
bytes need to be translated to the equivalent Unicode points using a mapping
table, and this is what Notepad does. The exception is if the file appears
to be Unicode UCS-2 or UTF-8, which Notepad determines by calling a system
API that takes a guess based on the nature of the bytes being loaded. If the
file appears to be UTF-8 or UCS-2, then Notepad loads the file
appropriately. So effectively, Notepad supports four encodings: UTF-8, UCS-2
(little endian and big-endian), and "current ANSI code page", which changes
depending on the system locale. If you are trying to edit GBK and Big5 text
files, you will therefore run into problems since notepad can only handle
one of these in a particular boot session of Windows2000.

Word2000 is similar to Notepad, except that it handles many more encodings
for both input and output, and is not beholden to the system code page. Like
Notepad, it makes an assumption that non-Unicode text is *probably* stored
in the current ANSI system code page, but it also runs some auto-detection
routines on the stream of bytes in the text file to see if that assumption
is wrong. You can also force Word to open the file in any encoding you like.
Likewise upon saving, Word will save the text file in the encoding used to
open the file, again allowing you to change this if you like. The cool thing
about all this is that if the auto detection works well for you (it is
pretty good), then you can work on text files of various encodings without
having to reboot anything or remember what encoding the file is. Just open
the files and edit them, then save them as encoded text, and you should be
fine: Big5, GBK, or whatever. You also get an indication when you save of
what characters are not supported in the encoding you have chosen to save
in. I actually designed this feature in Word2000 to support exactly what you
are trying to do, so I hope it works for you.

Another option you might consider (I sound like a salesman) is
FrontPage2000. You can edit in HTML view in FrontPage, see a WYSIWYG preview
and everything, and then save in any encoding you like. This is a lot easier
than using a plain text editor, and if you are concerned about how "clean"
the HTML is, FrontPage does not add stuff to your file so it is OK to edit
in HTML (as opposed to plain text). Using HTML makes it easy to track and
maintain the encoding of your files since the encoding is tagged right in
the file. English FrontPage also supports Chinese even on English Win2000.

Let me know if you still have trouble. (You can mail me off-list if you
like).
Chris

Sent with Office2000 SR1 wordmail

-----Original Message-----
From: Aaron Delwiche [mailto:aaron@lemon.com.hk]
Sent: Tuesday, March 21, 2000 9:03 PM
To: Chris Pratley; Unicode List
Subject: Re: Developing multilingual web sites

Hi Chris,

Thanks for pointing out that I've been using the phrase "native code"
incorrectly. I was actually referring to the ANSI code pages (e.g. Big 5,
Shift JIS, KSC) for particular regions. My problem was that I did not want
my HTML files to be in Unicode format. I want to edit files with text
editors such as notepad or UltraEdit, and I want to save these files in .txt
format without losing the reference to the appropriate ANSI code page.

Before I figured out how to change the default system locale, I was able to
cut and paste from Outlook Express to Notepad without any difficulties. Yet,
after saving the data in text format, the emulated encoding was apparently
lost the next time I retrieved it. Changing to a regionally appropriate font
did not remedy the situation.

I guess this is one part of the bg picture that I don't really understand.
If I have a file called "test.txt" that contains emulated encoding, how is
this file associated with a particular ANSI code page? Is it determined
solely by the character set that is being used by the application? Or is it
determined by the default system locale? Why would saving as text from
Notepad cause my files to (apparently) lose their emulated encoding?

In your message, you wrote:

> In any case, if your goal is to *paste* ANSI (non-Unicode) text into your
> applications, and they really do only handle ANSI plain text and not even
> HTML, then you need to change the system locale as you mentioned.

Does this mean that I definitely need to reboot each time I switch to a
different system locale?

Thanks again for your help. I realize that these are basic questions and
apologize to regular readers of this list for retreading familiar ground.

Aaron



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:00 EDT