Friday, June 19, 2009

Project: Krypton

Mathematical compression based on Pascal's Triangle, binomial coefficient, combinadic(s), etc.





stolen from (http://cas.umkc.edu/math/ExtraCaption.asp)


Basically, I've taken this as far as I can go alone... without help it will never be seen, let alone work.

The premiss is very simple... it is a mathematical, iterative algorithm that uses the binomial coefficient rather than being dictionary/symbol based.

It has several performance problems that do not allow me to test and otherwise effect it's general feasibility. Among these problems are the massive amounts of memory and computing power currently required for the algorithm. I believe intelligent optimizations in the underlying math can improve the computing speed at least.

Algorithm: combinations

The algorithm started as a non-iteractive, computive formula based on the idea of finite systems for a given data length.

In the simple-est of terms, given a fixed number of slots to occupy with the binary digits 1 or 0, there is a set number of possible combinations for those digits. The number of possibilities is easily computed using the following formula: 2^N (two raised to the exponent N), whereas N is the number of available slots.

For example a 2 slot system has 2^2 (= 4) possible data sets (00,01,10,11). A 3 slot system would have 2^3 (= 8) possible combinations (000,001,010,100,011,101,110,111).

These combinations can further be subdivided into groups based on how many 1's occupy the available slots. In the 3 slot example there are 4 groups... containing combinations of either 0, 1, 2, or 3 ones. Here are the groups broken down:

0 ones = 000

1 one = 001, 010, or 100

2 ones = 011, 101, or 110

3 ones = 111

As you can see there are 1, 3, 3, 1 combinations for 0, 1, 2, 3 ones.

This same distribution is visible in the fourth row of the graphic below (commonly known as a Pascal's Triangle, though the concept predates Pascal by centuries). Any row (and slot index) can be computed using the binomial coefficient.


Algorithm: compression, failed

So how do we get compression out of combinations?

If you look at the 3 slot example you see that the range of ones in a pattern must be between 0 and 3. The numbers 0-3 can be represented using two binary digits... 00, 01, 10, 11 represent 0, 1, 2, 3 counts for the available slots.

The goal of any compression algorithm is to represent data using smaller amounts of data. For a 3 slot system you can reperesent the number of ones with 2 digits... one less than the 3 comprising the data.

Unfortunately those 2 digits can only tell the de-compression algorithm which subgroup the original data is in. For example... a count of 2 (10 in binary) lands the decompressor in the group containing these 3 possibilities 011, 101, or 110 (each containing two 1 digits and one 0 digit).

There is no way for the decompressor to know which is the desired pattern without more information and we only have 1 bit left to help it out. The 2 problems with using that 1 bit...1.) is that it can only specify 2 choices rather than the 3 we seek an answer about... 2.) using this 1 bit brings us back to the original 3 bits therefore not saving us any data.

How do we solve these problems?

In any system it is a good idea to see how it scales. For example... if 2 digits can count a 3 slot data set, how many can 3 digits count? The answer of course is 2^3 which equals 8 and can represent a count of between 0 and 7 ones. Therefore a dataset of seven digits, such as '1010101' has a count of 4 ones (and 3 zeros) which will be represented by the binary string '100'. In this case we now have 4 of the original 7 digits to help us guess which pattern within a group is the correct one.

Unfortunately since our count is 4 there are 35 possible patterns (see 8th row in triangle graphic)... if we had a count of 3 it would also be 35 possible patterns. We have 4 bits left to help us guess (3 if we want compression). The 4 bits only help us guess within a set of 16 possibilities (2^4 = 16). If we had 5 bits we could tell the decompressor that it is one of a possible 32 patterns (2^5). Still that doesn't cover the full 35 available in a 7 slot system with 3 or 4 of the slots filled with ones. Also 3 counting bits plus the 5-6bits needed to help choose the correct pattern is a total of 8-9 bits.... which is actually more than the original 7 slots.

This is a major problem in pursuing a counting index-based algorithm.

The trend of having more guessing bits continues as you move to larger slotted data sets. Below are some examples of how many slots/bits are supported by a particula number of counting bits.


counting bits data set slots file size
16 65,535 slots 8 KB
24 16,777,215 slots 2 MB
32 4,294,967,295 slots 512 MB
35 34,359,738,367 slots 4 GB
43 8796093022208 slots 1 TB


As you can see in the chart above, even at 24 counting bits we can represent a large number of slots. The problem is when the number of filled slots is close to half the number of bits (for example 8,388,607 ones out of 16,777,215 slots... as it would be in the worse case scenario).

In this case the count would cost you 24 bits and the binary number to represent the index (of the correct pattern) within the group would probably at least a bit or 2 over 16,777,192.... lets say 16,777,200... plus the 24 counting bits.... so 16,777,224... slightly over our our original data length.

This is the same problem we encountered using 3 bits to count a 7 bit/slot data system.

Frustrating isn't it?

Algorithm: iteration

So whats the current problem... we just spent 24 bits to represent how many 1's are in these 16,777,215 slots... and we need a ton more to represent which is the correct pattern out of the many possibile ones that exist for that particular slot count. We need almost, but not quite as many digits as the original data... so really no compression is happening if we store our count and our 'answer index' bits together for the decompressor.

But what if we didn't store our 'answer index' bits?

What if we instead counted how many bits are 1's in our 'answer' bits?

But how is that going to help?... If the number of slots is almost as big as the original.... then wont the new count and the new answer also be too big? Well the key here is that its almost as big.... but its not quite as big. That and we aren't going to store the answer index this time either.... we are just going to store the 24 count bits from the original data as well as the 24 count bits from the 'answer bits' (now our new data).

Now if the original data is 16,777,215 we can do @699,050 iterations using a 24 bit counter. Of course if our final answer is more that 1 bit than we don't have compression again.

The question becomes how many iterations do we have to do before our answer is small enough that if we tack our answer onto our list of counters it is still smaller that the original. Is it even possible? Yes, it is possible... most mathematical compression algorithms work at least some of the time. We are trying to tackle the worst case scenario here as well (cases further from the center of the normal curve decay faster).

And in practice, after a certain number or iterations it will take only 23 bits to count the new data, then that will shrink to 22 bits eventually after a similar number of iterations.

I think this is enough for the first post. All ideas and concepts are mine alone and go back over many years. Don't steal them unless you wanna pay me one day. I'm going to slowly add the rest the concepts, to hopefully get some kind of open yet profitable implementaion going.

I have a few illustrative examples via code and graphics that I will post at some point.



No comments: