From: Kornkreismuster@web.de
Date: Thu Mar 16 2006 - 09:05:51 CST
Hi! Here is a small discussion I had privately.
I've got a problem to understand how it is possible to encode
Hex10FFFF characters with UTF-16. If I try to calculate the range of
UTF-16 I always get a maximum number of Hex10F7FF.
Calculation:
(DBFF - D7FF) * (DFFF - DBFF) + D7FF + FFFF - DFFF
(High Surr.) (Low Surr.) (0 to D7FF) (D800 to FFFF)
Please tell me how to encode Hex10FFFF characters.
Regards,
KKM
********************************************************
Your formula is right, and so is Ken. There are 1024 x 1024 = 1048576
code points accessible by surrogates, plus another 65536 in the BMP,
but
you have to subtract the 2048 surrogate code points. These are
permanently reserved because of their use in UTF-16.
-- Doug Ewell Fullerton, California, USA http://users.adelphia.net/~dewell/ ******************************************************** Hi! Thank you very much for your response. Thought allready I'm dumb. So in the Unicode charts all characters above FFFF are double-coded by themselfes and the surrogate-pairs. Can you also use the surrogate-pairs in UTF-32? Regards, KKM ******************************************************** KKM, No, nothing is double-coded. Each code point is uniquely identified by a single Unicode Scalar Value, including those beyond FFFF. When using UTF-16, they are encoded with a surrogate pair, while when using UTF-32, they are encoded as a single 32-bit value. Take, for example, the character U+10000 LINEAR B SYLLABLE B008 A (�). This is encoded as follows: UTF-8: F0 90 80 80 UTF-16: D800 DC00 UTF-32: 00010000 It is an error to use the surrogate pairs in UTF-32, that is, to encode the Linear B character above as 0000D800 0000DC00. (And, of course, it is impossible to encode the hex value 10000 directly in a 16-bit word.) The practice of describing Unicode code points above FFFF in terms of their surrogate pairs, instead of by the scalar value, dates back to earlier years, when UTF-16 was considered the standard form of Unicode and all others were considered "transformations." Please feel free to ask these questions on the list instead of privately. I wanted to post this answer on the list, but that would have been a violation of netiquette since your message was private. -- Doug Ewell Fullerton, California, USA http://users.adelphia.net/~dewell/ ______________________________________________________________ Verschicken Sie romantische, coole und witzige Bilder per SMS! Jetzt bei WEB.DE FreeMail: http://f.web.de/?mc=021193
This archive was generated by hypermail 2.1.5 : Thu Mar 16 2006 - 15:01:11 CST