CMIX

cmix is a lossless data compression program aimed at optimizing compression ratio at the cost of high CPU/memory usage. It gets state of the art results on several compression benchmarks. cmix is free software distributed under the GNU General Public License.

cmix works in Linux, Windows, and Mac OS X. At least 32GB of RAM is recommended to run cmix. Feel free to contact me at byron@byronknoll.com if you have any questions.

GitHub repository: https://github.com/byronknoll/cmix

Downloads

Source Code Release Date Windows Executable
cmix-v16.zip October 3, 2018 cmix-v16-windows.zip
cmix-v15.zip May 5, 2018 cmix-v15-windows.zip
cmix-v14.zip October 20, 2017 cmix-v14-windows.zip
cmix-v13.zip April 24, 2017 cmix-v13-windows.zip
cmix-v12.zip November 7, 2016 cmix-v12-windows.zip
cmix-v11.zip July 3, 2016 cmix-v11-windows.zip
cmix-v10.zip May 30, 2016 cmix-v10-windows.zip
cmix-v9.zip April 8, 2016 cmix-v9-windows.zip
cmix-v8.zip November 10, 2015
cmix-v7.zip February 4, 2015
cmix-v6.zip September 2, 2014
cmix-v5.zip August 13, 2014
cmix-v4.zip July 23, 2014
cmix-v3.zip June 27, 2014
cmix-v2.zip May 29, 2014
cmix-v1.zip April 13, 2014

Benchmarks

Corpus Original size
(bytes)
Compressed size
(bytes)
Compression time
(seconds)
Memory usage
(KiB)
calgary.tar 3152896 541361 2369.03 22176952
silesia 211938580 28857460
enwik6 1000000 178095 678.19 20133680
enwik8 100000000 14955482 60039.59 23433648
enwik9 1000000000 116912035 613898.54 27708568

Compression and decompression time are symmetric. The compressed size can vary slightly depending on the compiler settings used to build the executable.

External Benchmarks

Silesia Open Source Compression Benchmark

File Original size
(bytes)
Compressed size
(bytes)
dickens 10192446 1812032
mozilla 51220480 7061769
mr 9970564 1840309
nci 33553445 799885
ooffice 6152192 1235733
osdb 10085684 1965662
reymont 6627202 706271
samba 21606400 1638112
sao 7251944 3730422
webster 41458703 4321121
xml 5345280 237384
x-ray 8474240 3508760

Calgary Corpus

File Original size
(bytes)
Compressed size
(bytes)
BIB 111261 17506
BOOK1 768771 174739
BOOK2 610856 107520
GEO 102400 42992
NEWS 377109 78033
OBJ1 21504 7117
OBJ2 246814 40581
PAPER1 53161 10981
PAPER2 82199 17395
PIC 513216 22153
PROGC 39611 8436
PROGL 71646 9009
PROGP 49379 6307
TRANS 93695 10161

Canterbury Corpus

File Original size
(bytes)
Compressed size
(bytes)
alice29.txt 152089 31280
asyoulik.txt 125179 29647
cp.html 24603 4830
fields.c 11150 1976
grammar.lsp 3721 794
kennedy.xls 1029744 8133
lcet10.txt 426754 74194
plrabn12.txt 481861 112841
ptt5 513216 22153
sum 38240 6968
xargs.1 4227 1150

enwik8

Some language modeling benchmarks use enwik8 split into three sets: the first 90% for training, the next 5% for validation, and the last 5% for testing. Models are usually trained using multiple passes over the training set. This is not a standard way of benchmarking compression programs, but the performance of cmix can still be measured using this setup:

File Original size
(bytes)
Compressed size
(bytes)
Cross entropy
enwik8 100000000 14955482 1.1964
training set 90000000 13548217 1.2043
test set (no training) 5000000 835351 1.3366
test set (after training) 5000000 693239 1.1092
It was necessary to make a small change to the cmix source code in order to compute "test set (after training)". The code was modified to compress the test set after making a single pass through the training data.

Description

I started working on cmix in December 2013. Most of the ideas I implemented came from the book Data Compression Explained by Matt Mahoney.

cmix uses three main components:

  1. Preprocessing
  2. Model prediction
  3. Context mixing

The preprocessing stage transforms the input data into a form which is more easily compressible. This data is then compressed using a single pass, one bit at a time. cmix generates a probabilistic prediction for each bit and the probability is encoded using arithmetic coding.

cmix uses an ensemble of independent models to predict the probability of each bit in the input stream. The model predictions are combined into a single probability using a context mixing algorithm. The output of the context mixer is refined using an algorithm called secondary symbol estimation (SSE).

Architecture

architecture

Preprocessing

cmix uses a transformation on three types of data:

  1. Binary executables
  2. Natural language text
  3. Images

The preprocessor uses separate components for detecting the type of data and actually doing the transformation.

For images and binary executables, I used code for detection and transformation from the open source paq8pxd program.

I wrote my own code for detecting natural language text. For transforming the text, I used code from the open source paq8hp12any program. This uses an English dictionary and a word replacing transform. The dictionary is 463,903 bytes.

As seen on the Silesia benchmark, additional preprocessing using the precomp program can improve cmix compression on some files.

Model Prediction

cmix v16 uses a total of 2,011 independent models. There are a variety of different types of models, some specialized for certain types of data such as text, executables, or images. For each bit of input data, each model outputs a single floating point number, representing the probability that the next bit of data will be a 1. The majority of the models come from other open source compression programs: paq8l, paq8pxd, and paq8hp12any.

LSTM Mixer

architecture

The byte-level mixer uses long short-term memory (LSTM) trained using backpropagation through time. It uses Adam optimization with learning rate decay. The LSTM forget and input gates are coupled. I created another project called lstm-compress which compresses data using only LSTM. lstm-compress results are posted on the Large Text Compression Benchmark.

Context Mixing

mixer

cmix uses a similar neural network architecture to paq8. This architecture is also known as a gated linear network. cmix uses three layers of weights.

Acknowledgements

Thanks to AI Grant for funding cmix.

cmix uses ideas and source code from many people in the data compression community. Here are some of the major contributors: