From: Ruszlan Gaszanov (ruszlan@ather.net)
Date: Sun Jan 21 2007 - 07:41:56 CST
Frank Ellermann wrote:
> Some of your arguments like "won't need a BOM anymore" don't make
> sense for me...
Well, since conversions between UTF-21/24 and UTF-32 (and UTF-16 for BMP characters) is very trivial - much more so then with UTF-8,
some applications designers might prefer to use the same byte order for UTF-21/24 as they are using for UTF-16/32 in order to make
processing faster. Hence we might get BE/LE varieties of UTF-21/24 and have to deal with BOM issue. Therefore, the error dedection
mechanisms I proposed for UTF-24 varieties also allow automatic byte order detection.
> One disadvantage of your scheme, unlike UTF-8 it can't be directly
> expressed in CharmapML, the parity bit destroys simple patterns,
> and an enumeration of 2**21 (minus surrogates) code points won't
> fly.
Well, all proposed UTF-24 varieties, while useful for long term storage and interchange, might not be very well suited for actual
text processing in their pure form, since the presence of parity bits (in A and B) or resequenced combinations (in B and C) would
make some otherwise trivial tasks computation-intense. However, algorithmic conversion of UTF-24A to either UTF-21A or UTF-32 is
very trivial:
utf21a = utf24a & 0x7F7F7F
utf32 = (utf24a & 0x7F) | ((utf24a & 0x7F00) >> 1) | ((utf24a & 0x7F0000) >> 2)
Recalculating the parity bit (XORing of 21 data bits with each other), when converting back to UTF-24A, is not a very
computation-intense task either.
Therefore it would make much more sense to use either UTF-32 or UTF-21A for internal processing (each code unit would still become a
32-bit dword on 32/64-bit architecture) while storing and interchanging data in UTF-24A format (similarly to 7-bit ASCII, where
parity bit was reset for internal processing and then recalculated for storage/transmission purposes - so we don't take it into
account when making conversion tables for US-ASCII).
Conversions from/to UTF-21B, UTF-24B and UTF-24C requires a bit more processing, but those are special purpose encoding schemes for
restricted environments and conversion algorithms are no more complex (if not less complex) then for other similar purpose encoding
schemes, like Java-UTF-8 and UTF-7.
Note, that UTF-24A retains fixed-length properties of UTF-32 (while requiring less space) and provides built-in error detection
mechanism like UTF-8 (while generally requiring less processing and even beating it in terms of space consumption for East-Asian
texts). Although UTF-24A can't beat UTF-16 in terms of either processing or space requirements for BMP-only texts, it might be much
more attractive for the texts making extensive use of characters outside BMP, since we won't have to deal with surrogates.
Ruszlan
This archive was generated by hypermail 2.1.5 : Sun Jan 21 2007 - 07:44:12 CST