CMIX

cmix is a lossless data compression program aimed at optimizing compression ratio at the cost of high CPU/memory usage. cmix is free software distributed under the GNU General Public License.

cmix is currently ranked first place on the Large Text Compression Benchmark and the Silesia Open Source Compression Benchmark. It also has state of the art results on the Calgary Corpus and Canterbury Corpus. cmix has surpassed the winning entry of the Hutter Prize (but exceeds the memory limits of the contest).

cmix works in Linux, Windows, and Mac OS X. At least 32GB of RAM is recommended to run cmix. Feel free to contact me at byron@byronknoll.com if you have any questions.

GitHub repository: https://github.com/byronknoll/cmix

Downloads

Source Code Release Date Windows Executable
cmix-v12.zip November 7, 2016 cmix-v12-windows.zip
cmix-v11.zip July 3, 2016 cmix-v11-windows.zip
cmix-v10.zip May 30, 2016 cmix-v10-windows.zip
cmix-v9.zip April 8, 2016 cmix-v9-windows.zip
cmix-v8.zip November 10, 2015
cmix-v7.zip February 4, 2015
cmix-v6.zip September 2, 2014
cmix-v5.zip August 13, 2014
cmix-v4.zip July 23, 2014
cmix-v3.zip June 27, 2014
cmix-v2.zip May 29, 2014
cmix-v1.zip April 13, 2014

Benchmarks

Corpus Original size
(bytes)
Compressed size
(bytes)
Compression time
(seconds)
Memory usage
(KiB)
calgary.tar 3152896 551123 2454.82 20073596
silesia 211938580 30081329
enwik6 1000000 181003 987.05 19644788
enwik8 100000000 15440186 54979.86 24611240
enwik9 1000000000 121718424 571339.93 27865428

Compression and decompression time are symmetric.

Calgary Corpus

File Original size
(bytes)
Compressed size
(bytes)
Cross entropy
BIB 111261 17694 1.272
BOOK1 768771 177967 1.852
BOOK2 610856 109133 1.429
GEO 102400 43394 3.390
NEWS 377109 79206 1.680
OBJ1 21504 7189 2.674
OBJ2 246814 41136 1.333
PAPER1 53161 11069 1.666
PAPER2 82199 17508 1.704
PIC 513216 22481 0.350
PROGC 39611 8557 1.728
PROGL 71646 9260 1.034
PROGP 49379 6479 1.050
TRANS 93695 10300 0.879

Canterbury Corpus

File Original size
(bytes)
Compressed size
(bytes)
Cross entropy
alice29.txt 152089 31748 1.670
asyoulik.txt 125179 29960 1.915
cp.html 24603 4895 1.592
fields.c 11150 2029 1.456
grammar.lsp 3721 813 1.748
kennedy.xls 1029744 8472 0.066
lcet10.txt 426754 75236 1.410
plrabn12.txt 481861 114315 1.898
ptt5 513216 22481 0.350
sum 38240 7148 1.495
xargs.1 4227 1168 2.211

Description

I started working on cmix in December 2013. Most of the ideas I implemented came from the book Data Compression Explained by Matt Mahoney.

cmix uses three main components:

  1. Preprocessing
  2. Model prediction
  3. Context mixing

The preprocessing stage transforms the input data into a form which is more easily compressible. This data is then compressed using a single pass, one bit at a time. cmix generates a probabilistic prediction for each bit and the probability is encoded using arithmetic coding.

cmix uses an ensemble of independent models to predict the probability of each bit in the input stream. The model predictions are combined into a single probability using a context mixing algorithm.

Architecture

architecture

Preprocessing

cmix uses a transformation on three types of data:

  1. Binary executables
  2. Natural language text
  3. Images

The preprocessor uses separate components for detecting the type of data and actually doing the transformation.

For images and binary executables, I used code for detection and transformation from the open source paq8pxd program.

I wrote my own code for detecting natural language text. For transforming the text, I used code from the open source paq8hp12any program. This uses an English dictionary and a word replacing transform. The dictionary is 465,211 bytes.

As seen on the Silesia benchmark, additional preprocessing using the precomp program can improve cmix compression on some files.

Model prediction

cmix v12 uses a total of 1,746 independent models. There are a variety of different types of models, some specialized for certain types of data such as text, executables, or images. For each bit of input data, each model outputs a single floating point number, representing the probability that the next bit of data will be a 1. The majority of the models come from other open source compression programs: paq8l, paq8pxd, and paq8hp12any.

Context mixing

mixer

cmix uses a neural network to combine the model predictions into a single probability. This probability is then refined using an algorithm called secondary symbol estimation.

cmix uses a similar neural network architecture to paq8l. cmix v12 uses three layers of connections, with 415,136 neurons and 718,369,836 weights.

There are some differences compared to standard neural network implementations:

  1. cmix does not use backpropagation of gradients. Instead, every neuron in the network directly tries to minimize cross entropy.
  2. Instead of using a global learning rate, different modules of the network have different learning rate parameters.
  3. Only a small subset of neurons are activated for each prediction. The activations are based on a set of contexts (i.e. functions of the recent input history). The context-dependent activations improve prediction and reduce computational complexity.