converting a large website to Unicode: advice sought

From: Anatoly Vorobey (mellon@pobox.com)
Date: Wed Dec 12 2001 - 15:11:21 EST


Hello,

I'm in charge of preparing the conversion of a large DB-driven
dynamic website to using UTF-8 uniformly. There're some thorny
issues that I'm having difficulties deciding how to handle, so
I'll be very grateful for any advice.
Below is the description of my difficulties.

The website in question is http://www.livejournal.com, which is a
free service allowing one to keep an online journal conveniently
(and much more). All the software is written in Perl, is open-source,
and uses MySQL as the DB server. There is a large userbase (>300,000
active users), so any solution must leave the existing journal entries and
other kinds of text usable.

Currently the code and the site in general are completely 8-bit
clean and are as completely encodings-unaware. The vast majority of
users are Americans which use ASCII but also sometimes Latin-1 characters;
however there's also a fair amount of international users who use
whatever 8-bit encodings they're accustomed to, and set their browsers
appropriately.

The site, however, could benefit enormously from being converted to use
Unicode; for just one example, one of the most attractive features of the
site is "friends views", where you see on one page all latest entries
entered by people from your subscription list, in reverse chronological
order; obviously if you have "friends" writing in different languages, you
cannot view their entries correctly on one page now. We'd also like to offer
our users the ability to export their journals in XML, and other features
which demand encoding knowledge.

The modifications I'm writing will make every page on the site to be built and
output in UTF-8, including pages with HTML forms, so that new entries and other
information submitted by users via these forms will automatically be submitted
by their browsers in UTF-8 and stored this way in the database (are there any
gotchas to be aware of here?) I'm planning to use UTF-8 strings as opaque
strings to store in the database and handle in the code (they almost
never need to be formatted), i.e. I'm not planning to use Perl's
native UTF-8 support ot MySQL's not-yet-existent UTF-8 support. This
part seems to be relatively easy; the main problems I'm encountering are
with the existing data which is in various 8-bit encodings I have no knowledge
about. I can't translate it to UTF-8 automatically in the database.

Almost all of the text stored in the database consists of journal entries
and comments to journal entries; I plan to add a new column to the
appropriate tables which marks whether the entry or comment in question is
in UTF-8 or not. If not, the code which needs to display the text will
check the user's properties for a new "default encoding" property users will
be able to set in their profiles; if there is a default encoding, the code
will translate the text to UTF-8 on-the-fly, and if there's no default
encoding, the code will refuse to display the text (unless it's pure ASCII).

This seems to take care well enough of most data, and leads me to my main
difficulty: how to deal with a lot of small miscellaneous text data left
in the database: user names, profile information entered by the user such
as a biography or an interest list, text in per-user to-do lists (another
feature of the site), and more and more. There're a dozen or two places in
the database where some small segments of text entered by the user are stored,
and they're all currently in encoding-unaware 8-bit text. I can't deal with
them by adding a new column for each such kind of data to mark whether it's
UTF-8, as I'm doing for actual journal entries -- this will seriously bloat
the database and complicate the code. I can, I guess, try to translate it
to UTF-8 on-the-fly using the user's default encoding, but I still need
to distinguish somehow, e.g. user names written in native 8-bit encodings
which need such translation, and new user names entered after the site's
been converted to UTF-8, which are already in UTF-8 and need no conversion.
 
Should I use some kind of identifying mark inside the string (the BOM, maybe?)
Or should I perhaps check every string for UTF-8 correctness and assume that
if it's non-ASCII 8-bit text, it'll fail this test? Or is there some other
well-known solution to this problem?

Moreover, I'd like to provide an opportunity for a user to translate such
miscellaneous information to UTF-8 in the database by using the encoding the
user says the data is in (we won't do unprompted translation, but only at the
user's request), but I can't think of a good way to do this, from the UI point
of view. How can I show to the user their 8-bit data, and say: select the
encoding this data is in, and preview whether it's displayed correctly when
translated to UTF-8 from this encoding, given that the HTML pages implementing
this conversion interface should themselves be in UTF-8?

Finally, one other technical problem I'd like to ask advice about is the
question of how to mark all the pages on the site as containing UTF-8 text:
in HTTP headers, in <meta> tag inside HTML HEAD section, or in both. Since the
site is completely dynamic, I can do it in any way I want; but I want to do
the Right Thing (TM). I found some pages on the web strongly advising not to use
meta tags, e.g. because of recoding proxies on the way; but these pages are
very old and I don't know whether this is still something to be worried about.
Aside from that, someone reported to me that using just the HTTP header without
the meta tag doesn't work for some browsers, but I was unable to replicate this
effect.

Many thanks in advance for any advice!

Yours,
Anatoly.

-- 
Anatoly Vorobey,
my journal (in Russian): http://www.livejournal.com/users/avva/
mellon@pobox.com http://pobox.com/~mellon/
"Angels can fly because they take themselves lightly" - G.K.Chesterton



This archive was generated by hypermail 2.1.2 : Wed Dec 12 2001 - 15:26:47 EST