From: Rick Cameron (Rick.Cameron@businessobjects.com)
Date: Wed Jun 21 2006 - 14:23:47 CDT
However, if you are converting between UTF-8 and UTF-16 you do need to
take surrogates into account.
Perhaps the best approach is to go via UTF-32: for example, when
converting from UTF-16 to UTF-8, iterate through the array of UTF-16
code units, converting each code point to a UTF-32 code unit, then
convert the UTF-32 code unit to UTF-8. When iterating, you would check
whether the current UTF-16 code unit is the start of a surrogate pair or
not, and consume one or two code units as appropriate.
- rick
-----Original Message-----
From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On
Behalf Of Mike Ayers
Sent: Wednesday, 21 June 2006 11:31
To: Pavils Jurjans
Cc: unicode@unicode.org
Subject: Re: Surrogate pairs and UTF-8
Pavils Jurjans wrote:
> - The guides on unicode.org <http://unicode.org/> site talk only about
> surrogate pair and UTF-16 conversion. How about the UTF-8?
Surrogates do not exist in UTF-8. They are the mechanism by
which
UCS-2 (which encodes 16 bits) was simultaneously restricted and extend
to become UTF-16 (which encodes 21 bits). Surrogates are not
characters. They are UTF-16 code points only.
HTH,
/|/|ike
This archive was generated by hypermail 2.1.5 : Wed Jun 21 2006 - 14:45:43 CDT