Re: UTF8 vs. Unicode (UTF16) in code

From: Allan Chau (achau@rsasecurity.com)
Date: Fri Mar 09 2001 - 14:46:00 EST


Yves Arrouye wrote:

> > > On 03/08/2001 07:40:25 PM "Ayers, Mike" wrote:
> > >
> > > > If you really want to finish the job, there's always
> > > UTF-32, which
> > > >should do rather nicely until we meet the space aliens aith the
> > > >4,293,853,186 character alphabet!
> > >
> > > Um... no. The 1,113,023 character alphabet (one more than the
> > > encodable
> > > scalar values in the codespace supported by UTF-8 / 16 / 32).
> > >
> >
> > Um... no. The UTF-32 CES can handle much more than the current
> > space of the Unicode CCS. As far as I can tell, it's good to
> > go until we
> > need more than 32 bits to represent the ACR. I'm actually
> > surprised that
> > this comment was so misunderstood. Ah, well...
>
> Since the U in UTF stands for Unicode, UTF-32 cannot represent more than
> what Unicode encodes, which is is 1+ million code points. Otherwise, you're
> talking about UCS-4. But I
> thought that one of the latest revs of ISO 10646 explicitely specified that
> UCS-4 will never encode more than what Unicode can encode, and thus
> definitely these 4 billion characters you're alluding to.
>
> YA

Isn't the U in UTF for UCS (Universal Character Set)? It was my understanding
that except for possibly some header information & endianess, UTF-32 is the
same as UCS-4 valuewise. The difference is that UTF-32 is an encoding, UCS-4
is a character representation scheme. Is this correct? If so, I don't think
it's absolutely correct to say what range UTF-32 can represent.

There's so many Uxx acronyms around that it's very confusing for a beginner.
Correct me if I'm wrong - UCS-2, UCS-4, & Unicode are for talking about
character representations. UTF-x mean encodings.

BTW, thanks to all that have contributed comments to this thread. I'm finding
the feedback very helpful.

-allan





This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT