From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Dec 06 2007 - 17:53:58 CST
Doug Ewell wrote:
> Envoyé : jeudi 6 décembre 2007 21:35
> À : Unicode Mailing List
> Cc : William J Poser; aprilop2007@trashmail.net
> Objet : Re: Rot13 and letters with accents
>
> William J Poser <wjposer at ldc dot upenn dot edu> wrote:
>
> > But then even better would be to Unicode-ify rot13 so that it affects
> > non-ASCII characters. For example, restricting ourselves to the BMP,
> > we could have rot7FFF, which would produce meaningless strings of CJK
> > characters from (extended) Latin text.
>
> (This is not quite the same thing, but you might find it interesting
> nonetheless:
> http://www.mindspring.com/~markus.scherer/unicode/base16k.html )
One of the design goals for Base16k is:
* The characters should be inert under most Unicode text transformations,
especially normalization, but ideally also case mapping etc., so that such
an encoding of binary data does not get corrupted by common processing.
While also choosing a subset of the BMP that is continuous and immune to
almost all Unicode transforms (including normalizations). But I'm not sure
that the chosen subset (in the Han ideographic block) is effectively immune
to these transforms.
I would have probably chosen a block that is really immune to all Unicode
transforms and mapping, and used the large PUA block of the BMP to implement
such binary encoding, but one of the goal says:
* Unassigned and private-use code points should be avoided because they are
often restricted and could be affected by future or custom processing.
Despite I agree this statement about assigned code points, I don't
understand the justification about the exclusion of PUAs, which are standard
in Unicode. Those restrictions affecting some applications could as well
apply to the Han ideographs, that are also not immune to "custom
processing", such as decompositions into component radicals and strokes.
So if we take a binary transform that is really immune, just use the PUAs of
the BMP. Yes there are less than 16k code points there, but anyway, using a
block of 4k codes would not degrade severely the performance in terms of
extra binary length.
So, suppose you have 4k code points allocated in PUAs for this purpose, it
allows encoding 12 bits of binary data, i.e. one byte and a half, and only
the status of the last encoded byte (that may or may not encode an actual
decoded byte) would be significant (and you could also use the method used
in Base64 where an extra padding "=" sign is appended to complete a
sequence).
In addition, such processing would be much easier.
Number of Number of
binary bytes Base4k characters
------------ -------------------------------------
3N 2N
3N+1 2N+1 (+ optional 1 padding character)
3N+2 2N+1 (+ optional 1 padding character)
If UTF-16 is used, the transport stream has this length:
Number of Number of UTF-16 code units for Byte lengths in the
binary bytes Base4k characters transport stream
------------ ------------------------------------- ------------------
3N 2N 4N
3N+1 2N+1 (+ optional 1 padding character) 4N+2
3N+2 2N+1 (+ optional 1 padding character) 4N+2
Unlike Base64 but like Base16k, the number of characters per binary byte is
not unique. However, for the first case (3N+1) you would need only 8 bits of
information for encoding the last byte, but the second case (3N+2) will not
use all the 12 bits in the last base16k encoded character, so 4 bits will
remain unused, you could solve the problem by appending a single last
character in the cases of an encoding of (2N+1) characters, to determine if
the previous (last encoded) character encodes one or two binary bytes. You
could also set one bit in the first character used for the last sequence of
the (3N+1) and (3N+2) binary bytes case (there are 4 or 8 unused bits there)
to see if it encodes one or two binary bytes, and no extra padding is
necessary and not extra block of characters for the last sequence. (For
decoding, it just requires testing a single bit of the encoded length in
characters, and if this bit odd, the last character encodes this bit in the
addition of the final 4 or 8 bits of source binary data).
So encoding the total length as a decimal number prefix is not absolutely
necessary (it could be done optionally and verified in the decoder, that
would accept those leading digits if they are present, and that are
separated from the 4k characters used for the binary encoding).
If UTF-8 is used, the transport stream has this length:
Number of Number of UTF-16 code units for Byte lengths in the
binary bytes Base4k characters transport stream
------------ ------------------------------------- ------------------
3N 2N 6N
3N+1 2N+1 (+ optional 1 padding character) 6N+2
3N+2 2N+1 (+ optional 1 padding character) 6N+4
(it is mostly identical in this case to Base64)
Efficiency comparison (assuming that the 4k characters are allocated in a
block where each one requires 3 bytes in UTF-8, so it is within the BMP):
% UTF-8 UTF-16 SCSU
base64 75.0% 37.5% 75.0%
base4k 75.0% 66.7% 75.0%
base16k 58.3% 87.5% 87.5%
The main interest would not be for the transport stream, because 8-bit byte
encodings and byte order independence is the important feature of UTF-8, so
Base64 would be better. But for local data storage and management, it is
significantly better than Base64, assuming that UTF-16 code units are
preserved and their byte order is predicable. What this suggests is that
Base64 and base4k would work in concert, Base64 being used only for the
transport, and local management using Base4k over UTF-16 instead as it is
twice more efficient, but still very simple to decode (note that for
computing addresses, the need to use a division by 3 is faster to implement
than a division by 7, even if you use the division by
multiplication-and-shift trick).
This archive was generated by hypermail 2.1.5 : Thu Dec 06 2007 - 17:58:38 CST