From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jun 21 2006 - 15:15:10 CDT
Pavils Jurjans asked:
> - I have read the theoretical definition of what a surrogate pair is.
> However, I have never seen any in "life". Can you give an example of some
> surrogate pairs, and how do their respective character look like?
> - The guides on unicode.org site talk only about surrogate pair and
> UTF-16 conversion. How about the UTF-8?
"Surrogate pairs" don't exist in UTF-8.
Surrogate pairs refer to the 2 16-bit code unit sequences required
to represent Unicode code points U+10000..U+10FFFF in UTF-16.
That same range of code points is represented by 4-byte
sequences in UTF-8, as defined by the Tables 3-5 and Table 3-6
you were referring to in The Unicode Standard, Version 4.0.
Look at Table 3-3, Examples of Unicode Encoding Forms.
U+10302 is represented in UTF-32 by the 32-bit code unit: 0x00010302
U+10302 is represented in UTF-8 by the 4 byte sequence: <F0 90 8C 82>
U+10302 is represented in UTF-16 by the two 16-bit code unit
sequence: <D800 DF02>
That last encoding, in UTF-16 only, is referred to as a "surrogate pair".
--Ken
This archive was generated by hypermail 2.1.5 : Wed Jun 21 2006 - 15:47:42 CDT