Re: Surrogate pairs and UTF-8

From: Mike (mike-list@pobox.com)
Date: Wed Jun 21 2006 - 18:33:54 CDT

Next message: J Andrew Lipscomb: "Re: unicode Digest V6 #134"

Previous message: Rick McGowan: "IUC 30 Program Announced"
In reply to: Pavils Jurjans: "Surrogate pairs and UTF-8"
Next in thread: Addison Phillips: "RE: Surrogate pairs and UTF-8"
Reply: Addison Phillips: "RE: Surrogate pairs and UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

If you come across a surrogate in an UTF-8 stream, you should treat
it as an error (e.g. throw an exception or something).

However if you are converting from UTF-16 to UTF-8, then you will
need two surrogates (a high surrogate and a low surrogate) to
determine which character is encoded. Table 3.4 in the link you
cited shows how to convert from surrogate pairs to codepoints.

Once you know which codepoint is encoded, use table 3-5 to compute
the byte values in the UTF-8 sequence.

Mike

P.S. Here is an array you should find useful in determining how to
decode UTF-8 sequences:

const uchar Utf8Length[256] = {
     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
     1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
     2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
     3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
     4, 4, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 6, 6, 7, 8
};

Use the first byte of the UTF-8 sequence as an index into this
array. The value is the length of the UTF-8 sequence. If it is
not in the range 1-4, then you have run into an error. Once you
have the length, if it is in the range 2-4, check the next n-1
bytes to make sure they return 0.

Pavils Jurjans wrote:
> Hello all,
>
> I am a developer who needs to write UTF-8 encoder and decoder in
JavaScript. I've found the encoding form in the link
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G31703
<http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G31703>, and that
is pretty much what I need to do the job. However, I am completely
lacking in-depth information about the surrogate pairs and how to handle
them in UTF-8. So, here are the questions, what I am looking for:
> - I have read the theoretical definition of what a surrogate pair is.
However, I have never seen any in "life". Can you give an example of
some surrogate pairs, and how do their respective character look like?
> - The guides on unicode.org <http://unicode.org/> site talk only
about surrogate pair and UTF-16 conversion. How about the UTF-8?
>
> Thank you for any clues.
>
> With kind regards,
> Pavils Jurjans
>

Next message: J Andrew Lipscomb: "Re: unicode Digest V6 #134"
Previous message: Rick McGowan: "IUC 30 Program Announced"
In reply to: Pavils Jurjans: "Surrogate pairs and UTF-8"
Next in thread: Addison Phillips: "RE: Surrogate pairs and UTF-8"
Reply: Addison Phillips: "RE: Surrogate pairs and UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jun 21 2006 - 18:57:07 CDT