At 11:02 AM 1/18/02 -0800, Barry Caplan wrote:
>I've always been under the impression that one of the original goals of
>the Unicode effort was to do away with he sort of multi-width encodings we
>are all too familiar with (EUC, JIS, SJIS, etc.). this was to be
>accomplished by using a fixed width encoding. In my mind, everything other
>than that in order to increase space (but not necessarily to save
>bandwidth) is a kluge, and a compromise, because it means code still has
>to be aware of the details of the the encoding scheme.
That was one of the original goals - however, it was to be achieved by
'composing' a lot more characters than are composed (or even composable)
now. Such an ideal Unicode would have had no Arabic Presentation forms
(saves about 1K codes), no Hangul syllables (saves 11K codes), no Han
variants (large, but harder to estimate savings), no polytonic Greek and
many fewer Latin pre-composed characters (ideally, it should have had none
in which case the savings would have been another K or so).
Under such a system, all the scripts coded today would have fit in the BMP,
with lots of room to spare. However, the majority of al users would have
had to have systems that could handle character code sequences of variable
length for many common tasks.
By using variable length enoding like UTF-16, which front loads practically
all the commonly used characters into the single unit case, and which,
unlike combining character sequences are of only two possible lengths (1
and 2), which can be determined at the start of the sequence, and which use
code positions that don't alias between single unit code points and part of
double unit sequences -- in other words, by doing all this, the fact that
the majority of software *cannot* handle variable length sequences has
become of much less importance than it would have with "idealCode".
Conversely, since some scripts require the use of character sequences even
now, the gains of going to UTF-32 are limited. Yes, lower level
infrastructure is a bit easier than UTF-16, but supporting complex scripts,
or advanced usage of not so complex scripts, isn't.
Incidentally, using twice as much space for your character codes seems
unimportant, when you look at network bandwidth and disk space, media that
are shared with gigantic amounts of non-character data. Things begin to
look a lot different when you are trying to process large amounts of
character data and suddenly realize that your high speed cache can fit
nearly twice more characters when you use UTF-16, compared to UTF-32.
A./
This archive was generated by hypermail 2.1.2 : Fri Jan 18 2002 - 16:59:15 EST