Re: Case Table Compresison Assumptions (was: RE: Posting Links to Ballots (was: RE: Why blackletter letters?)) from Steffen on 2013-09-14 (Unicode Mail List Archive)

From: Steffen <sdaoden_at_gmail.com>
Date: Sat, 14 Sep 2013 15:44:19 +0200

Hello,

|It is a false economy for a general Unicode library implementation
|to be overly clever about how it compresses tables, such as casing
|tables. That approach can get you into trouble when something else is
|added to the standard which breaks your initial assumptions.

oh no, i'm not making any assumptions (or you would have found
a bug) – i'm only joining together ranges like those that i (not
a linguist) would have expected from alphabets that have a notion
of upper- and lowercase, like a-z and A-Z, and in addition those
which follow an uppercase/lowercase, uppercase/lowercase order.

The problem is not such compressable alphabets, which require
only two entries to represent the mapping in between upper- and
lowercase (because of their nice property of being gapless and
having a constant distance in between the upper- and lowercase
mapping), but any other character that is placed by itself,
because each such character requires also one entry.

I.e., looking at the proposed U+A7AB LATIN CAPITAL LETTER REVERSED
OPEN E, i look around and see the yet existent U+025C LATIN SMALL
LETTER REVERSED OPEN E, and here i am with two distinct entries
and therefore just as many as i need to represent all of ASCII!

So – all the input that i can add to this discussion is thus the
question why we have so many free reserved blocks on the one hand,
but on the other the (to me infamous) spaced-far-apart variation
selectors (completely on their own since U+E0000-U+E00FF is
entirely deprecated), and such «character islands» that slowly
trickle in over time.

It seems to me that somehow this complicated process – maybe
nonetheless – results in decisions that end up placing items like
U+025C in a cute range without taking into account existent cased
variants; assuming the reason being as simple as «we will never
need the uppercase variant» may prove correct. Just to realize
the opposite, and may it only be for completeness sake.

Indeed U+A7AB is not the only example of this scheme in the
pipeline, there is also U+A7B2 LATIN CAPITAL LETTER J WITH
CROSSED-TAIL (uppercase of U+029D); maybe more, yet the mentioned
make up four entries already.

…My own project… is yet a mix of a table and an array based
approach, but it's early alpha.
It will most likely generate two different dump formats in the
end: one fast but purely table based, practically imposing the
restriction to be usable only on systems with shared dynamic
library support, because the data will consume, well, i don't know
yet, but surely over a megabyte; maybe even more as time goes by
(i hope to be able to get through to collation support).

And a slow but space-minimized binary search array one with
reduced character-class informations. This one is for systems
which don't have a notion of shared dynamic libraries / objects,
where each saved byte really matters a lot (since of course the
actually used data needs to be replicated into each and every
program that uses it). There are systems of this kind out there
(and really smart ones, and i'm not saying «still»). Also, POSIX
yet doesn't support building of shared objects via the standard
tools, like c99(1). I.e., you have to assume blindly or be
system-specific.

Indeed i agree with you.
Thanks a lot, and ciao from where the greyness comes from,

|--Ken

--steffen

attached mail follows:

Steffen,

FYI, Unicode 7.0, when it comes out, will have another entire
bicameral (casing) script added to it: Warang Citi. And when
Old Hungarian is finally published, at some point after Unicode 7.0,
that will be *another* bicameral script added. It is unlikely that those
two will be the last. And those are in addition to the continual trickle
of case pairs to already existing bicameral scripts like Latin and
Cyrillic.

It is a false economy for a general Unicode library implementation
to be overly clever about how it compresses tables, such as casing
tables. That approach can get you into trouble when something else is
added to the standard which breaks your initial assumptions.

If you want to do this kind of thing, my suggestion would be
instead to do a two-step process: first implement a general
table which can always be easily updated based on new additions
to UnicodeData.txt (and/or SpecialCasing.txt and CaseFolding.txt,
depending on what kind of case tables you are implementing),
and which doesn't worry too much about table size. Then
write a *separate* optimization step which can compress
your generic table format into a more compact format.
If you do it that way, your adaptation to future additions to
the standard can be much more robust, while still optimizing
for minimal table size.

--Ken

>
>
> I have been able to compress all lower-, upper- and titlecase
> mappings, simple and extended (no conditions yet) of Unicode 6.2
> into a 260 entry binary search array.
> I'm not with this project at the moment, but looking at the
> alloc/Pipeline.html it *could* be that those few characters alone
> will add maybe 10 (sorry..) more slots, if the presence of SMALL
> or CAPITAL indicates they'll be Lt/Lu/Ll or will have an entry in
> `SpecialCasing.txt'.
> I hope that this wonderful thing that is the UCS will not become
> blurred -- memory size is still a concern for some people.
> (Reading how the process works doesn't give a lot of hope, yet
> that is what came to my mind.)
> Ciao,
>
> --steffen

Received on Sat Sep 14 2013 - 08:46:20 CDT

This archive was generated by hypermail 2.2.0 : Sat Sep 14 2013 - 08:46:20 CDT