From: Antoine Leca (Antoine10646@leca-marti.org)
Date: Thu Jan 20 2005 - 13:09:18 CST
Hans Aberg va escriure:
>> Something like 99% of text data uses only BMP characters for which
>> UTF-16 is pretty efficient.
>
> One can achieve better efficiency, if needed, using data compression
> methods. So there is no reason to use UTF-16 for such reasons.
OTOH, variable length encoding such as UTF-8 is a nightmare when it comes to
efficiency for a number of applications. So what? The answer looks like
different depending on the needs: different needs, different answers.
UTF-16 is a compromise, as is UTF-8, as is UCS in a lot of ways.
> According to my memory, [...].
> Linux also uses UTF-8.
Well, among others character sets. The main distributions head this way,
yes, probably. All Linux boxes are running with UTF-8, certainly not. Heaven
forbids.
> So in that domain, I think there is little use of UTF-16.
So what?
> The main problem is that in some domains, UTF-16 is already at use.
> So there, one would need time to change.
This assumes that UTF-16 is 'wrong', isn't it? And furthermore, that UNIX
(whatever you are hiding behind this word) is 'right'.
> In the case of the C++
> standard, one knows it takes at least a few years for a new
> versions to come forth. I do not remember the exact wording for a
> feature that is still in the standard, but to be phased out in a
> later version.
'Deprecated'. For example, ANSI C (1989) deprecated the use of KnR-style
fonctions. But it still a part of the current standard, so it will be
*required* to be supported by all conforming compilers till at least 2009.
In other words, do not hold your breath.
The life cycle of ISO standards has few in common with high-tech evolution.
UTF-16 (and UTF-8, and BOM) are part of a ISO standard. So... do not hold
your breath!
> My guess is that UTF-8 will be widespread file and
> external stream format, because it is more compact,
AH AH AH!
I happen to work with Indic contents. Taking the natural/legacy encoding
(ISCII) as 1, UTF-16 is roughly 2, and UTF-8 is more than 2.7. That is, it
become bigger than the (Unicode) Latin transcriptions using a lot of
accents!
East Asians users have similar concerns.
Granted, this is more compact in Europe (where I am perfectly happy with
Latin-9, BTW), or for application such as... GCC, a compiler.
So what?
Antoine
This archive was generated by hypermail 2.1.5 : Thu Jan 20 2005 - 13:14:54 CST