On 07/21/2000 04:42:05 AM <rosenne@qsm.co.il> wrote:
>Unicode is the code, which is based on 16 bit chunks of ether or whatever,
and
>UTF-8 is a biased transformation format...
That's too simple to capture the current reality, as others have been
indicating. The full story is availble in UTR17, and *everybody* on this
list ought to read and digest it - of all the UTRs, it's probably the one
that's most useful to be read by the broadest audience.
http://www.unicode.org/unicode/reports/tr17/
In a nutshell, Unicode started life being 16-bit monowidth, but the need to
extend and merge with ISO 10646 made life more complicated. At this point,
there is no real option but to say that Unicode is a 21 (or 20.1) bit*
character set combined with various encoding forms and schemes based on 8,
16 or 32 bit data types.
* The codespace for the encoded character set takes a little explanation.
The simplification is that it's 0 - 10FFFF (which takes 21 bits to
represent but doesn't go as far as 21 bits would allow - that would be
1FFFFF). Actually, you have to remove from this D800 - DFFF and 34 values
that match the pattern nnnnFE and nnnnFF.
- Peter
---------------------------------------------------------------------------
Peter Constable
Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:06 EDT