From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Feb 27 2003 - 14:38:36 EST
Stefan Persson suggested:
> >Unicode 3.0 defined non-shorted UTF-8 as *irregular* code value
> >sequences. There were two types:
> >
> >   a. 0xC0 0x80 for U+0000 (instead of 0x00)
> >   b. 0xED 0xA0 0x80 0xED 0xB0 0x80 for U+10000 (instead of 0xF0 0x90 0x80 
0x80)
> >  
> >
> Ah, but encoding NULL as a surrogate character and then encoding those 
> two surrogates as three bytes, making totally 6 bytes a character, would 
> also be technically possible (though not legal), right?
I'm not sure what you are talking about, here.
First of all, there is no such thing as a "surrogate character",
under the terminology currently adopted by the standard.
There are surrogate code points: U+D800..U+DFFF. Those can
*never* be assigned to any abstract character.
Then there are surrogate code units: 0xD800..0xDFFF. Those are
used in pairs in the UTF-16 encoding form to represent a single
supplementary character (one encoded off the BMP).
NULL is U+0000. 
  Its representation in UTF-32 is <0x00000000>.
  Its representation in UTF-16 is <0x0000>.
  Its representation in UTF-8  is <0x00>.
  
Period. End of story. Anything else is nonconformant to the standard.
--Ken
This archive was generated by hypermail 2.1.5 : Fri Feb 28 2003 - 02:37:35 EST