From: Ruszlan Gaszanov (ruszlan@ather.net)
Date: Sat Jan 20 2007 - 16:52:08 CST
Why would we need a new UTF? Well, of all currently available encoding schemed for Unicode, only UTF-32 is fixed-length. However, while it might be convenient for internal processing on 32/64bit platforms, 11 spare bits per code unit is much too wasteful for long-term storage and interchange. Again if we have spare bits, why not just as well make them useful for, let+IBk-s say, error detection or avoiding undesired sequences (like NUL).
So, in order to encode all code points of ISO-10646, we need minimum 21 bit (well, actually that would+IBk-ve been 20 if we didn+IBk-t have to waste an extra bit on being able to address the completely useless, IMHO, Plane 16). But be it as it may, we still have to deal with octet-oriented data interchange and storage technology. So whether 20 or 21 bits, we are stuck with the minimum of 3 octets for all practical purposes, which leaves 3 bits unused. But then again, if we distribute our 21 bits among 3 octets evenly, we+IBk-ll get 7 data bits per octet. This is much more useful then having 2 octets with 8 data bits each and the remaining octet with 3 spare bits, considering we still have some 7-bit gateways around.
So, this is what I propose to call UTF-21A:
UTF-32: 00000000 000utsrq ponmlkji hgfedcba
UTF-21A: zutsrqpo ynmlkjih xgfedcba
Lowercase letters a-u are used here to denote each data bitin order from least significant to most significant. 8th bit of each octet - x, y and z has no function here and may be used for parity if the protocol so desires.
However, since we have those 3 spare bits, we might just as well make them useful in 8-bit environment - for instance, for byte order and error detection. Let+IBk-s say bit z will be always set to 1, bit x to 0, and bit y shall be used as parity bit across 3 octets of each code unit.
UTF-24A: 1utsrqpo Pnmlkjih 0gfedcba
UTF-24, in contrast to UTF-21, means that all 24 bits are made useful. Uppercase P (parity bit) is not to be confused with lowercase p (one of 21 data bits). Although isolated 3-octet blocks having such properties might occur in any random data, continuous ranges of such blocks with consistent byte order would be quite distinct as UTF-24A signature. Not to mention that each code unit will act as BOM, so we+IBk-ll no longer need the infamous U+-FEFF. Also, the decoder would be able to resync at next good character, in case some octets are dropped or a random octet is inserted. So we wouldn+IBk-t have the rest of the stream scrambled like with UTF-16 and UTF-32 in such situations.
But that+IBk-s not all wasted bitspace that can be made useful. Note, that because we need an extra bit just to be able to address Plane 16, we have 14 wasted combinations in the 5-bit Plane ID (bits utsrq). So, let+IBk-s say we use them to avoid the notorious NUL octets (00000000), which are known to cause problems in text-processing tools. For instance, if we have those NUL octets in our UTF-24A sequence, we set both bits u and t to 1, which wouldn+IBk-t be a valid Plane ID by definition. Then we set bit s to 1 if the high octet is NUL, bit r to 1 if the middle octet is NUL and bit q to 1 if the low octet is NUL.
Since high octet being NUL means that we have Plane ID +AD0- 00000 (BMP), we don+IBk-t need to encode this information anywhere. Otherwise we can encode plane ID in one of the other NUL octets (since we already know that it is NUL):
UTF-21A: z0000000 ynmlkjih xgfedcba
UTF-21B: z11100po ynmlkjih xgfedcba
UTF-21A: zutsrqpo y0000000 xgfedcba
UTF-21B: z11010po y11utsrq xgfedcba
UTF-21A: zutsrqpo ynmlkjih x0000000
UTF-21B: z11001po ynmlkjih x11utsrq
UTF-21A: zutsrqpo y0000000 x0000000
UTF-21B: z11001po y11utsrq x1111111
Where Plain ID is moved to another octet, unused bits 6 and 7 are set to 1, just to make sure it doesn+IBk-t stay NUL anyway. NUL octets not used to encode Plain ID are set to 1111111 for the sake of simplicity. And then we could have UTF-24B for 8-bit environment, which does to UTF-21B just what UTF-24A does to UTF-21A.
Finally, I+IBk-ve thought up another encoding scheme based on UTF-21A, lets us avoid C0/C1 (0x00-0x1F/0x80-0x9F) controls along with 0x7F (DEL), which may be desirable under certain circumstances. Unfortunately, I couldn+IBk-t see how to pack that mechanism into 21 bits (so, no 7-bit-safe version here), but UTF-24C could work like this:
UTF-21A: zutsrqpo ynmlkjih x00edcba
UTF-24C: 0utsrqpo 0nmlkjih 101edcba
UTF-21A: zutsrqpo ynmlkjih x1111111
UTF-24C: 0utsrqpo 0nmlkjih 11111110
UTF-21A: zutsrqpo y00lkjih xgfedcba
UTF-24C: 0utsrqpo 110lkjih 0gfedcba
UTF-21A: zutsrqpo y1111111 xgfedcba
UTF-24C: 0utsrqpo 11111101 0gfedcba
UTF-21A: z00srqpo ynmlkjih xgfedcba
UTF-24C: 111srqpo 1nmlkjih 0gfedcba
Note, that since 11111 is not a valid Plain ID, we do not need the 0x7F replacement for the high octet. Also note that bits 6 and 7 of C0 escapes, as well as bits 1 and 2 of DEL replacement are different depending on the octet. Thus any 3-octet sequence with at least 2 +IBw-special+IB0- octets can be used as BOM, and a sequence with at least 1 +IBw-special+IB0- octet - as possible resync point, as long as byte order is known. This BOM/resync mechanism might be a bit less reliable then in UTF-24B. But, since both high and middle octets will remain +IBw-special+IB0- in all characters up to U+-0FFF (from ASCII to Tibetan range) and the high octet still be +IBw-special+IB0- up to Plain 7, we+IBk-ll normally have plenty of such sequences in any text.
If someone is interested, I can post sample encoder functions in JScript.
Any comments?
P.S. I've tried to send a slightly different version of this a few days ago, but apparently it didn't get thru. So, enjoy the revised one +ADs-) And don't yell at me if the listserv eventually spits out the old version in a month or so. After all, I'm not supposed to know whether it got stuck in some really long queue for some reason or some server just decided to store it in /dev/nul.
This archive was generated by hypermail 2.1.5 : Sat Jan 20 2007 - 16:54:17 CST