Re: unicode Digest V6 #134

From: J Andrew Lipscomb (ewwa@chattanooga.net)
Date: Wed Jun 21 2006 - 18:46:08 CDT

Next message: Pierpaolo BERNARDI: "Re: Surrogate pairs and UTF-8"

Previous message: Mike: "Re: Surrogate pairs and UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> I am a developer who needs to write UTF-8 encoder and decoder in
> JavaScript.
> I've found the encoding form in the link
> http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf#G31703
> , and that is pretty much what I need to do the job. However, I am
> completely lacking in-depth information about the surrogate pairs
> and how to
> handle them in UTF-8. So, here are the questions, what I am looking
> for:
> - I have read the theoretical definition of what a surrogate pair is.
> However, I have never seen any in "life". Can you give an example
> of some
> surrogate pairs, and how do their respective character look like?
> - The guides on unicode.org site talk only about surrogate pair and
> UTF-16 conversion. How about the UTF-8?

Surrogate pairs don't exist in UTF-8. Surrogate pairs are a concept
unique to UTF-16 for representing characters beyond the Basic
Multilingual Plane. To take an example, consider the character U
+1D444 (mathematical italic capital Q). Since this requires more than
16 bits, UTF-16 uses the surrogate pair concept, which works as follows:
First, subtract &h10000 (using &h to denote hexadecimal), leaving
&hD444.
Next, write that out as a 20-bit binary number (0000 1101 0100 0100
0100). Now, divide that into two groups of 10 bits (0000110101
0001000100).
Then stick 110110 in front of the high half (1101 1000 0011 1001),
and convert to &hD835. That codepoint is the high surrogate.
Similarly for the low half, using 110111 as the prefix, you get 1101
1100 0100 0100 or &hDC44. Those two codepoints together represent U
+1D444 in UTF-16.

But in UTF-8, you just represent it in the same pattern as any other
character. The bit pattern is 11101010001000100, which breaks into
sixes as 11101 010001 000100. But since the first part is too long to
fit the lead byte of a three-byte character in UTF-8, you have to
make it a four-byte character, zero-padding to 000 011101 010001
000100. Adding the appropriate prefixes yields 11110000 10011101
10010001 10000100, which is the UTF-8 representation of the character
(F0 9D 91 84).

Next message: Pierpaolo BERNARDI: "Re: Surrogate pairs and UTF-8"
Previous message: Mike: "Re: Surrogate pairs and UTF-8"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Jun 21 2006 - 19:08:23 CDT