Re: UTF-8 and Big5

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Jun 01 2006 - 13:27:02 CDT

  • Next message: Guy Steele: "Re: apostrophes"

    From: "Samuel Thibault" <samuel.thibault@ens-lyon.org>
    > Philippe Verdy, le Thu 01 Jun 2006 04:59:27 +0200, a écrit :
    >> Except that this is often boringto add in each edited web page, and that the user said that it would set it in server-side settings (meaning that UTF-8 would be generated in HTTP headers).
    >
    > There's a default charset configuration in apache server configuration
    > file:
    >
    > AddDefaultCharset Big5
    > should be fine.

    No it won't be fine to solve the evoked problem. Setting this only will mean that UTF-8 files will be served as Big5 and this will break browers downloading those UTF-8 resources,as they will be decoded as Big5!

    The user asked that he would like to use UTF-8 as the default, but also being able to continue to serve Big5-encoded resources.

    I think that your suggestion is already what the user has (a Big5-only server that he wants to progressively migrate to UTF-8).

    My suggestion is better as it allows keeping the resources unchanged without needing to reencode them all (especially when there's a huge amount of Big5 resources).

    Note that renaming files is not the only option. In Apache you can also define separate MIME types for resources in separate URL-directories.

    So you're not required to rename files on the server: you can also create a second virtual URL-directory for Big5 resources, and it will beeven simpler, notably if Big5 resources are located in a distinct physical directory or database on the server, and the UTF-8 resources are stored separately.

    Each virtual URL-directory in Apache can have their own set of MIME-type definitions, providing their own distinct charset (so no need to edit HTML files to change the meta tags, no need to reencode them, no need to rename files or move them into other storage directories on the server)

    The file renaming option (with a distinct extension) is sometimes preferable if the resources have links between each other, because moving into separate directories would break those URLs that would need to be fixed by editing the HTML resources containing them: checking the URLs embedded in lots of resources is often a very lengthy and risky process.

    But changing the file extension on the server only for those resources that have been converted does not require changing these URLs, as the effective physical file extension on the server can be invisible in the URLs (note for example how Apache describes and can use a secondary file extension to honor language preferences as set in the browser, like "*.zh.html" for Chinese, versus "*.en.html" for English and "*.html" for a default language used only in URLs but mapped on the server to one of the localized HTML files chosen as the default...)

    Philippe.



    This archive was generated by hypermail 2.1.5 : Thu Jun 01 2006 - 13:41:55 CDT