RE: Surrogate pairs and UTF-8

From: Kenneth Whistler (kenw@sybase.com)
Date: Mon Jun 26 2006 - 16:40:33 CDT

Next message: Andrew Cunningham: "Re: References on Perl & Unicode"

Previous message: Richard Wordingham: "Re: Finnegans Wake, was Re: comment on L2/06-215"
Maybe in reply to: Pavils Jurjans: "Surrogate pairs and UTF-8"
Next in thread: Mark Davis: "Re: Surrogate pairs and UTF-8"
Reply: Mark Davis: "Re: Surrogate pairs and UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> > One essential detail being that UTF-16 surrogates are excluded
> > from the valid Unicode codepoints, while UTF-8 "surrogates"
> > have binary values that are also valid Unicode codepoints.
>
> I almost added that but held back because it seemed to me that that's
> not really a difference in these encoding forms but rather is just a
> fact about the coded character set. But then, IIRC UTF-16 is not able to
> represent code points U+D800..U+DFFF while UTF-8 is.

Nope. Neither can.

0xD800 is ill-formed in UTF-16.

0xED 0xA0 0x80 is ill-formed in UTF-8.

For that matter, 0x0000D800 is ill-formed in UTF-32.

Look it up.

Now, anybody could put those values into a Unicode string
and claim to be representing U+D800, but as a famous
former president said, they "would be wrong." *hehe*

--Ken

Next message: Andrew Cunningham: "Re: References on Perl & Unicode"
Previous message: Richard Wordingham: "Re: Finnegans Wake, was Re: comment on L2/06-215"
Maybe in reply to: Pavils Jurjans: "Surrogate pairs and UTF-8"
Next in thread: Mark Davis: "Re: Surrogate pairs and UTF-8"
Reply: Mark Davis: "Re: Surrogate pairs and UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jun 26 2006 - 16:48:23 CDT