Thanks, Doug, for the comments.
>And I don't think you're supposed to exclude the surrogate code space
(0xD800
>through 0xDFFF) from normal processing. (This is the "D29 conundrum" --
all
>UTFs must support encoding of non-characters, including unpaired
surrogates,
>even though UTF-16 cannot do this.) The code you provided encodes
unpaired
>surrogates in four bytes -- by pushing them down to the final "else" --
which
>is wrong in any event and almost certainly not what the programmer
intended.
Yes, this is a goof. (I wrote a pseudo-code algorithm for going from
Unicode scalar values to UTF-8 and assumed "surrogate" USVs are not valid.
I wasn't anticipating at the time what a programmer would do with it.)
Any suggestions on what the right way to deal with "surrogate" codepoints
in this algorithm? They should not occur in the data, but what if they do?
- Peter
---------------------------------------------------------------------------
Peter Constable
Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <peter_constable@sil.org>
This archive was generated by hypermail 2.1.2 : Fri Nov 09 2001 - 11:47:10 EST