From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat May 10 2008 - 00:58:22 CDT
Looks like this fast evolution for unicode comes mostly from China, and from
the need to support Chinese on websites operated outside China: GB encodings
was not chosen, and even China does not support very actively its GB
encodings.
With the exploding number of Chinese Internet users, this looks very
decisive. We'll still have a minority of ASCII only webpages, but it's true
that even in Europe the ISO8859-based charsets are not enough, as many, many
websites are now multilingual.
The UCS has strong advantages everywhere, and the conversion from ISO8859-*
to UTF-8 is quite easy now, given the number of tools and libraries capable
of working with Unicode.
The main problem that remains is with some popualr tools that still don't
come bundled and preinstalled to support the full UCS; PHP is probably one
of these tools, whose Unicode support is either poor or slow.
On the opposite, .Net, Java, and Javascript have native support for the UCS
(at least in the BMP, where surrogates don't have to be treated specially,
but almost all the needed characters for modern usage are in the BMP, except
some "less frequent" characters occasionnaly taken from the supplementary
ideographic plane).
For languages that absolutely need support of the full UCS, going to a
32-bit internal encodingf is still possible, but most development do not
even care about it: these is for a limited number of pages or resources, in
comparison to the tons of pages and site that don't even need any character
out of the BMP.
So the sharp increase of UTF-8 is highly correlated to the progressive
abandon of GB-* by millions (billion soon?) of new Chinese Internet users.
Is the support of GB18030 still mandatory for products sold in China, if the
support of Unicode offers the same coverage benefits with the addition of
more interoperability?
If only this page on the Google blog could help convince European
administrations or organization to stop making pages encoded with ISO8859-*
(and often not labelled at all! Many French administrations and commercial
websites are still using ISO-8859-1 without even labelling it explicitly, so
their pages don"t display properly, as the heuristic algorithms used by
browsers to "guess" the encoding frequently "detect" a JIS encoding so that
runs of characters with one containing an accent appear replaced by
ideographs).
Let's convince everyone go with UTF-8 on the web, everything else will
follow the UCS path including in documents, databases (even if they are
handled internally with UTF-16, possibily leaving some bugs for incorrectly
handled surrogates; these bugs are simple to solve)...
_____
De : unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] De la
part de Mark Davis
Envoyé : lundi 5 mai 2008 21:25
À : Unicode
Objet : FYI: Google posting about U5.1
FYI, we have a posting on Google's official blog
(http://googleblog.blogspot.com/) about Unicode 5.1 and the growth of
Unicode that we're seeing.
-- Mark Internal Virus Database is out of date. Checked by AVG. Version: 8.0.100 / Virus Database: 269.23.8/1415 - Release Date: 05/05/2008 06:01
This archive was generated by hypermail 2.1.5 : Sat May 10 2008 - 09:05:49 CDT