From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Feb 21 2011 - 14:44:39 CST
Don't use the proposed C code as a reference as it contains obvious
buffer overflow problems (incorrect computing of the necessary buffer
length), and inefficient way to handle large files (files don't need
to be fully read in a single buffer), in addition to unusual
command-line syntax (and not recommended because of its ambiguity).
Anyway, this list is not the place for posting implemntation code
Finally, the proposal is ill in the way it implements its BOM (one
for each supported page) : why not sticking on using U+FEFF, or using
one of the unused scalar values (e.g. encoding each page code numbers
as a scalar >= 0x110000). Remember that ASCII controls are reserved
for something else, and have many restrictions for use in portable
plain-text (notably MIME compatibility : don't use FS,GS,RS,US for
that).
And why do you need in fact an enumeration of codepage numbers, when
all that is needed is to allow encoding the page base as the scalar
value of the first character allowed in all possible supported page
(for example, all pages are aligned on a row (16-character) boundary,
so the 21 bits reduce to 15 bits. You can easily map any one of these
15-bit values into a base page selector mapped on one of the many
unused scalar values (>=0x110000, which has plenty of available space,
even with this encoding), without requiring any "magic" lookup table
(and in this case you coukd even have a data converter select the best
page automatically).
But note that non-ASCII bytes are still requiring tests to see if they
are leading or trailing bytes of a multibyte sequence: ASCII bytes are
self-synchronizing, middle-bytes are easy to process in backward or
forward direction to find the leading or trailing byte with a small
bounded number of tests, BUT long sequences of characters encoded as
2-bytes will require non bounded numbers of tests until you find
either a middle-byte or an ASCII byte, just to know if one of those
bytes with the binary 10xxxxxx pattern are leading or trailing.
But as those characters encoded as 2-byte sequences are limited to the
U+0080..U+1080 range (in this proposal, when the selected page
encodable as one byte does not fall within this range), it will not be
a problem for texts written in the Latin script (due to the hygh
frequency of ASCII letters), but for other alphabetic scripts in this
range (as long as there's no space, or punctuation or control) if they
are not using a page selection for the most frequent letters of their
alphabet.
Even if all scripts in this range are using spaces and standard ASCII
punctuations, this is still a problem for implementation of fast
searching algorithms (but the additonal compression of data, compared
to UTF-8, would still optimize a bit the data locality for memory
caches and I/O when handling large amounts of texts to compare, and
the additional tests will have certainly a lower impact)
But if I had a comment to say, I think that I would have even allowed
splitting the 64-character page into two separate 32-character
subpages (mappable individually), in order to support the range of
1-byte codes used by C1 controls or Windows codepages for very small
scripts: here again, this ony requires a few special codes mappable on
the many unused scalar values (0xD800..0xDFFF, 0x11000..)
One of its interest is that (without using any base page selector) it
can support C0 controls, the US-ASCII printable characters, and the
full alphabetic area of ISO-8859-1 as single bytes (instead of 2 bytes
with UTF-8). And it cannot be longer than UTF-8 and it also allows
bidirectional parsing of the encoded stream (including from a random
position with very few tests to resynchronize in both directions)
But I admit that this proposal has its merit : once, corrected (for
the above deficiencies) it can correctly map ***ALL*** Unicode scalar
values (not "code points" because UTFs are not supposed to support all
code points, but only those that have a scalar value, i.e. excluding
all "surrogate" code points U+D800..U+DFFF for strict compatibility
with UTF-16 which cannot represent them, but still including other
code points assigned to other non-characters such as U+FFFF):
For this to work effectively, you must absolutely DROP the special
non-standard handling of FS/GS/RS/US (because it is not even needed
!), and replace it either with the existing standard BOM U+FEFF (or
even better by the encoding of a single leading page selector whose
encoded scalar value does not fall within the standard Unicode scalar
values 0..0xD7FF, 0xE000..0x10FFFF).
And anyway it is also much simpler to understand and easier to
implement correctly (not like the sample code given here) than SCSU,
and it is still very highly compressible with standard compression
algorithms while still allowing very fast processing in memory in its
decompressed encoded form :
- a bit faster than UTF-8, as seen in my early benchmarks, for small
number of large texts such as pages in a Wiki database,
- but a bit slower for large number of small strings such as tabular
data, because of the higher number of conditional branches when using
a CPU with a 1-way instruction pipeline (not a problem with today's
processors that include a dozen of parallel pipelines even in a single
core, if the compiled assembly code is correctly optimized and
scheduled to make use of them when branch-prediction cannot help
much).
Philippe.
2011/2/20 suzuki toshiya <mpsuzuki@hiroshima-u.ac.jp>:
> Doug Ewell wrote:
>> <mpsuzuki at hiroshima dash u dot ac dot jp> wrote:
>>
>>> In your proposal, the maximum length of the coded character
>>> is 4, it is less than UTF-8's max length. It's interesting
>>> idea.
>>
>> What code sequences in UTF-8 that represent the modern coding space
>> (ending at 0x10FFFF, not 0x7FFFFFFF) are more than 4 code units in length?
>
> Oh, I'm sorry. I slipped to remember the shrinking of ISO/IEC 10646
> codespace is reduced the max length of UTF-8.
>
> Regards,
> mpsuzuki
>
>
This archive was generated by hypermail 2.1.5 : Mon Feb 21 2011 - 14:49:46 CST