Re: UTF8 vs. Unicode (UTF16) in code

From: Yves Arrouye (yves@realnames.com)
Date: Fri Mar 09 2001 - 23:53:25 EST


> > Since the U in UTF stands for Unicode, UTF-32 cannot
> represent more than
> > what Unicode encodes, which is is 1+ million code points.
> Otherwise, you're
> > talking about UCS-4. But I
> > thought that one of the latest revs of ISO 10646
> explicitely specified that
> > UCS-4 will never encode more than what Unicode can encode, and thus
> > definitely these 4 billion characters you're alluding to.
>
> As far as I know the U in UTF stands for Universal - not unicode.
> ISO 10646 can encode characters beyond UTF-16, and should retain
> this capability. There is a proposal to restrict UTF-8 to
> only encompas the same values as UTF-16, but UCS-4 still encodes
> the 31-bit code space.

Page 12 of the Unicode Standard 3.0 says:

    "UTF-8 (Unicode Transformation Format-8) [...]"

which is what I used to build my knowledge of what the U stands for. But
I may be wrong.

Thanks for clarifying my confusion between the proposal for restricting
UTF-8, not UCS-4. So if the ISO never said that they will not encode
things beyond what Unicode can encode, and if UTF-8 is restricted, they
may someday need a UCSTF-8 (or whatever) to encode UCS-4, right? And the
only difference between UTF-8 and this UCSTF-8 may be the semantics of
what can be encoded and what is legal after decoding.

YA



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:20 EDT