Re: Surrogate pairs and UTF-8

From: Mark Davis (mark.davis@icu-project.org)
Date: Mon Jun 26 2006 - 19:32:50 CDT

  • Next message: Alexandros Diamantidis: "Greek reversed iota and upsilon with tilde"

    I agree with Ken, with one clarification. It is not possible to represent
    D800 in any well-formed UTF. D800 may well occur in a Unicode string (see
    definitions D29a-d

    On 6/26/06, Kenneth Whistler <kenw@sybase.com> wrote:
    >
    >
    >
    > > > One essential detail being that UTF-16 surrogates are excluded
    > > > from the valid Unicode codepoints, while UTF-8 "surrogates"
    > > > have binary values that are also valid Unicode codepoints.
    > >
    > > I almost added that but held back because it seemed to me that that's
    > > not really a difference in these encoding forms but rather is just a
    > > fact about the coded character set. But then, IIRC UTF-16 is not able to
    > > represent code points U+D800..U+DFFF while UTF-8 is.
    >
    > Nope. Neither can.
    >
    > 0xD800 is ill-formed in UTF-16.
    >
    > 0xED 0xA0 0x80 is ill-formed in UTF-8.
    >
    > For that matter, 0x0000D800 is ill-formed in UTF-32.
    >
    > Look it up.
    >
    > Now, anybody could put those values into a Unicode string
    > and claim to be representing U+D800, but as a famous
    > former president said, they "would be wrong." *hehe*
    >
    > --Ken
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Mon Jun 26 2006 - 19:39:53 CDT