cmix is a lossless data compression program aimed at optimizing compression ratio at the cost of high CPU/memory usage. cmix is free software distributed under the GNU General Public License.
cmix is currently ranked first place on the Large Text Compression Benchmark and the Silesia Open Source Compression Benchmark. It also has state of the art results on the Calgary Corpus and Canterbury Corpus. cmix has surpassed the winning entry of the Hutter Prize (but exceeds the memory limits of the contest).
cmix works in Linux, Windows, and Mac OS X. At least 32GB of RAM is recommended to run cmix. Feel free to contact me at firstname.lastname@example.org if you have any questions.
GitHub repository: https://github.com/byronknoll/cmix
|Source Code||Release Date||Windows Executable|
|cmix-v12.zip||November 7, 2016||cmix-v12-windows.zip|
|cmix-v11.zip||July 3, 2016||cmix-v11-windows.zip|
|cmix-v10.zip||May 30, 2016||cmix-v10-windows.zip|
|cmix-v9.zip||April 8, 2016||cmix-v9-windows.zip|
|cmix-v8.zip||November 10, 2015|
|cmix-v7.zip||February 4, 2015|
|cmix-v6.zip||September 2, 2014|
|cmix-v5.zip||August 13, 2014|
|cmix-v4.zip||July 23, 2014|
|cmix-v3.zip||June 27, 2014|
|cmix-v2.zip||May 29, 2014|
|cmix-v1.zip||April 13, 2014|
Compression and decompression time are symmetric.
cmix uses three main components:
- Model prediction
- Context mixing
The preprocessing stage transforms the input data into a form which is more easily compressible. This data is then compressed using a single pass, one bit at a time. cmix generates a probabilistic prediction for each bit and the probability is encoded using arithmetic coding.
cmix uses a transformation on three types of data:
- Binary executables
- Natural language text
The preprocessor uses separate components for detecting the type of data and actually doing the transformation.
For images and binary executables, I used code for detection and transformation from the open source paq8pxd program.
I wrote my own code for detecting natural language text. For transforming the text, I used code from the open source paq8hp12any program. This uses an English dictionary and a word replacing transform. The dictionary is 465,211 bytes.
cmix v12 uses a total of 1,746 independent models. There are a variety of different types of models, some specialized for certain types of data such as text, executables, or images. For each bit of input data, each model outputs a single floating point number, representing the probability that the next bit of data will be a 1. The majority of the models come from other open source compression programs: paq8l, paq8pxd, and paq8hp12any.
cmix uses a neural network to combine the model predictions into a single probability. This probability is then refined using an algorithm called secondary symbol estimation.
cmix uses a similar neural network architecture to paq8l. cmix v12 uses three layers of connections, with 415,136 neurons and 718,369,836 weights.
- Loss function: cross entropy
- Activation function: logistic
- Optimization procedure: stochastic gradient descent
There are some differences compared to standard neural network implementations:
- cmix does not use backpropagation of gradients. Instead, every neuron in the network directly tries to minimize cross entropy.
- Instead of using a global learning rate, different modules of the network have different learning rate parameters.
- Only a small subset of neurons are activated for each prediction. The activations are based on a set of contexts (i.e. functions of the recent input history). The context-dependent activations improve prediction and reduce computational complexity.