Re: UniCode website is confusing

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed May 29 2002 - 16:27:39 EDT


Theodore Smith said:

> I find the unicode website very confusing.

Not a very useful observation. If you have a specific suggestion for
improvement, make it to the webmaster at:

http://www.unicode.org/unicode/reporting.html

>
> Is it that to get any useful non confusing information,
> we have to buy your huge book?

I guess you may not have spent enough time on the site to notice
that the huge book is online:

http://www.unicode.org/unicode/uni2book/u2.html

Look in Chapter 3 for the UTF-8 definition.

> What with all the addendums, addendums
> to addendums and addendums to addendums to addendums, crossings out,
> etc etc. It becomes impossible to work out what you are really saying.
>
> Why not just make one technical standard, like w3.org do for XML?

Because we cannot publish a new version of the entire book
every year. The editors are well aware of the fact that amended
text to the standard in the online publications for Unicode 3.1
and Unicode 3.2 makes some sections difficult to follow for the
most recent edition. That is why we *do* republish the entire text
of the standard at intervals, for the major editions, when we can.

>
> My problem is, I'm trying to w ork out how UTF8, and UTF16 are
> encoded.

This is all available in the online version.

> I heard that UTF32 can have surrogate pairs!

You have heard incorrectly. See:

http://www.unicode.org/unicode/reports/tr19/

UAX #19 "UTF-32":

"An irregular UTF-32 code unit sequence is an eight-byte sequence where
the first four bytes correspond to a high surrogate, and the next four
bytes correspond to a low surrogate. As a consequence of C12, these
irregular UTF-32 sequences shall not be generated by a conformant
process."

> This is pretty
> crazy I think because UTF can only encode 10FFFF (a nice number, comes
> to 1114111 a nicer number) values. While 4 bytes can hold over 4
> billion values. So whats the use of surrogates with UTF32?

None.

>
> I can't find this information.

Now you have it.

> I have found addendums to addendums
> that might or might not be the final answer, or the complete answer,
> but I can't tell because its not all compiled into one standard
> definition.

Unicode 4.0 will all be compiled into one huge book again. But
then I wonder if you actually want to "buy [our] huge book" in
any case. ;-)

--Ken

BTW, it is "Unicode" -- not "UniCode".

>
>
> --
> Theodore H. Smith - Macintosh Consultant / Contractor.
> My website: <www.elfdata.com/>



This archive was generated by hypermail 2.1.2 : Wed May 29 2002 - 15:35:01 EDT