cmix is a lossless data compression program aimed at optimizing compression ratio at the cost of high CPU/memory usage. It gets state of the art results on several compression benchmarks. cmix is free software distributed under the GNU General Public License.
cmix works in Linux, Windows, and Mac OS X. At least 32GB of RAM is recommended to run cmix. Feel free to contact me at firstname.lastname@example.org if you have any questions.
GitHub repository: https://github.com/byronknoll/cmix
|Source Code||Release Date||Windows Executable|
|cmix-v18.zip||August 1, 2019||cmix-v18-windows.zip|
|cmix-v17.zip||March 24, 2019||cmix-v17-windows.zip|
|cmix-v16.zip||October 3, 2018||cmix-v16-windows.zip|
|cmix-v15.zip||May 5, 2018||cmix-v15-windows.zip|
|cmix-v14.zip||October 20, 2017||cmix-v14-windows.zip|
|cmix-v13.zip||April 24, 2017||cmix-v13-windows.zip|
|cmix-v12.zip||November 7, 2016||cmix-v12-windows.zip|
|cmix-v11.zip||July 3, 2016||cmix-v11-windows.zip|
|cmix-v10.zip||May 30, 2016||cmix-v10-windows.zip|
|cmix-v9.zip||April 8, 2016||cmix-v9-windows.zip|
|cmix-v8.zip||November 10, 2015|
|cmix-v7.zip||February 4, 2015|
|cmix-v6.zip||September 2, 2014|
|cmix-v5.zip||August 13, 2014|
|cmix-v4.zip||July 23, 2014|
|cmix-v3.zip||June 27, 2014|
|cmix-v2.zip||May 29, 2014|
|cmix-v1.zip||April 13, 2014|
Compression and decompression time are symmetric. The compressed size can vary slightly depending on the compiler settings used to build the executable.
- Large Text Compression Benchmark
- Silesia Open Source Compression Benchmark
- Lossless Photo Compression Benchmark
Silesia Open Source Compression Benchmark
cmix uses three main components:
- Model prediction
- Context mixing
The preprocessing stage transforms the input data into a form which is more easily compressible. This data is then compressed using a single pass, one bit at a time. cmix generates a probabilistic prediction for each bit and the probability is encoded using arithmetic coding.
cmix uses an ensemble of independent models to predict the probability of each bit in the input stream. The model predictions are combined into a single probability using a context mixing algorithm. The output of the context mixer is refined using an algorithm called secondary symbol estimation (SSE).
cmix uses a transformation on three types of data:
- Binary executables
- Natural language text
The preprocessor uses separate components for detecting the type of data and actually doing the transformation.
For images and binary executables, I used code for detection and transformation from the open source paq8pxd program.
I wrote my own code for detecting and transforming natural language text. It uses an English dictionary and a word replacing transform. The dictionary comes from the phda Hutter prize entry and is 415,377 bytes.
cmix v18 uses a total of 2,122 independent models. There are a variety of different types of models, some specialized for certain types of data such as text, executables, or images. For each bit of input data, each model outputs a single floating point number, representing the probability that the next bit of data will be a 1. The majority of the models come from other open source compression programs: paq8l, paq8pxd, and paq8hp12any.
The byte-level mixer uses long short-term memory (LSTM) trained using backpropagation through time. It uses Adam optimization with layer normalization and learning rate decay. The LSTM forget and input gates are coupled. I created two other projects which compress data using only LSTM: lstm-compress and tensorflow-compress. Their results are posted on the Large Text Compression Benchmark.
- Loss function: cross entropy + L2 regularizer
- Activation function: logistic
- Optimization procedure: stochastic gradient descent
- Every neuron in the network directly tries to minimize cross entropy, so there is no backpropagation of gradients between layers.
- The inputs to each neuron (values between 0 to 1) are transformed using the logit function.
- Only a small subset of neurons are activated for each prediction. The activations are based on manually defined contexts (i.e. functions of the recent input history). One neuron is activated for each context. The context-dependent activations improve prediction and reduce computational complexity.
- Instead of using a global learning rate, each context set has its own learning rate parameter. There is also learning rate decay.
Comparison to PAQ8cmix shares many similarities to the PAQ8 family of compression programs. There are many different branches of PAQ8. Here are some of the major differences between cmix and other PAQ8 variants:
- In terms of performance, cmix typically has a better compression rate but is slower and uses more memory. cmix uses more predictive models than most PAQ8 variants.
- cmix uses a larger context mixing network. There are also several implementation details which differ (e.g. floating point arithmetic, learning rate decay, number of layers, etc).
- cmix uses two separate PAQ8 branches as internal models. One of the branches is paq8hp12any (an early Hutter Prize submission). The other branch is a hybrid of several PAQ8 programs (i.e. a custom version of PAQ8 unique to cmix).
- The LSTM component is unique to cmix.
- The use of mod_ppmd. This is a PPM implementation that produces byte-level predictions. The predictions are used as input to the LSTM.
Thanks to AI Grant for funding cmix.
cmix uses ideas and source code from many people in the data compression community. Here are some of the major contributors:
- Matt Mahoney
- Alex Rhatushnyak
- Eugene Shelwien
- Márcio Pais
- Kaido Orav
- Mathieu Chartier
- Fabrice Bellard