Re: Surrogate pairs and UTF-8

From: Mark Davis (mark.davis@icu-project.org)
Date: Mon Jun 26 2006 - 19:32:50 CDT

Next message: Alexandros Diamantidis: "Greek reversed iota and upsilon with tilde"

Previous message: Andrew Cunningham: "Re: References on Perl & Unicode"
In reply to: Kenneth Whistler: "RE: Surrogate pairs and UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I agree with Ken, with one clarification. It is not possible to represent
D800 in any well-formed UTF. D800 may well occur in a Unicode string (see
definitions D29a-d

On 6/26/06, Kenneth Whistler <kenw@sybase.com> wrote:
>
>
>
> > > One essential detail being that UTF-16 surrogates are excluded
> > > from the valid Unicode codepoints, while UTF-8 "surrogates"
> > > have binary values that are also valid Unicode codepoints.
> >
> > I almost added that but held back because it seemed to me that that's
> > not really a difference in these encoding forms but rather is just a
> > fact about the coded character set. But then, IIRC UTF-16 is not able to
> > represent code points U+D800..U+DFFF while UTF-8 is.
>
> Nope. Neither can.
>
> 0xD800 is ill-formed in UTF-16.
>
> 0xED 0xA0 0x80 is ill-formed in UTF-8.
>
> For that matter, 0x0000D800 is ill-formed in UTF-32.
>
> Look it up.
>
> Now, anybody could put those values into a Unicode string
> and claim to be representing U+D800, but as a famous
> former president said, they "would be wrong." *hehe*
>
> --Ken
>
>
>

Next message: Alexandros Diamantidis: "Greek reversed iota and upsilon with tilde"
Previous message: Andrew Cunningham: "Re: References on Perl & Unicode"
In reply to: Kenneth Whistler: "RE: Surrogate pairs and UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Jun 26 2006 - 19:39:53 CDT