I’ve been looking at backing up some data from an old hard drive recently and would like to compress it to use less CDs. Normally I’d just use GZIP for compression but a friend of mine swears by BZIP2. Knowing that my linux distro sports at least 4 different compression tools “out of the box”, I thought it’s time to get some numbers. Bring on Compression Wars!
The idea’s simple. Gather together a variety of compression tools, test them head-to-head against a variety of file types and see how the perform. There needs to be a few different types of file types involved as certain files compress easier than others. For example, text files should compress alot more than video due to the fact that video codecs already contain compression algorithms.
The sort of things I’m backing up are: Music (mainly MP3s), Pictures (mainly JPEGs), Videos (a mixture of MPEGs and AVIs / DIVXs), and Software (Both in the form of binary files and source code). I have therefore split the test into the following categories:
- Binaries
- 6,092,800 Bytes taken from my /usr/bin director
- Text Files
- 43,100,160 Bytes taken from kernel source at /usr/src/linux-headers-2.6.28-15
- MP3s
- 191,283,200 Bytes, a random selection of MP3s
- JPEGs
- 266803200 Bytes, a random selection of JPEG photos
- MPEG
- 432,240,640 Bytes, a random MPEG encoded video
- AVI
- 734,627,840 Bytes, a random AVI encoded video
I have tarred each category so that each test is only performed on one file (As far as I’m aware, tarring the files will not affect the compression tests). Each test has been run from a script 10 times and an average has been taken to make the results as fair as possible. The things I’m interested in here are compression and speed. How much smaller are the compressed files and how long do I have to wait for them to compress / decompress.
Although there are many compression tools available, I decided to use the 5 that I consider the most common. GZIP, BZIP2, ZIP, LZMA, and the linux tool Compress. One of the reasons for this test is to find the best compression, so where there was an option, I have chosen to use the most aggressive compression offered by each tool.
Here’s what I found:
This graph show the size of the compressed file as a percentage of the original, so the smaller, the better.
So it seems BZIP2 does outperform GZIP all round and the lesser used LZMA does even more so! I was quite surprised how well the binaries compressed but the results for the {MP3, JPG, MPEG, AVI} collection of files shows that there is little to no point of trying to compress these formats, they are already pretty optimal. Something else interesting was with the Compress tool (nCompress). Using the {MP3, JPG, MPEG, AVI} formats, it churned away for a while trying to squash these then, without producing an error, exited leaving only the original file. I presume that the compressed versions were larger than the originals and the is some logic stating “If output_size > input_size; return input”. The clear winner here though is LZMA.
So now we know how will these tools compress, how long do they take to function?
Notice that these speeds are in MB / Second, so the larger, the better. Although Compress is the worst at its job for compression, it’s great if you’ve got alot of files and are in a rush. and GZIP and ZIP competing pretty much neck and neck.As good as LZMA is at it’s job though, it likes to take it’s time about things running at just 1MB/Sec or less.
What about decompression. If I’m uploading some large files to my webserver for people to download, I might not mind waiting an hour to compress them really small to save on bandwidth, but I don’t want to do this at the expense of my users. could you imagine, “click here to download, then wait 4 hours for it to decompress…”. Well here’s the results:
Again speeds here are in MB / Second (Size of the compress file over time), so the larger the better. My Friend was right before to state that BZIP2 offers better compression than GZIP, but, as shown here, at the expense of the user’s time, with BZIP2 taking over 6 times longer than GZIP to decompress things. That’s the difference between waiting 10 minutes or an hour for the same files to decompress! Even the mighty LZMA is sometimes nearly twice as fast as BZIP2.
Test Bed
Just to put these tests into perspective for you, the machine I ran these on has a 3.2GHz P4 with 1GB of RAM. I’m sure it would run better on newer machines but I believe the results are a good indication or the ability of these tools.
Conclusion
The graphs speak for themselves really. If you need heavy compression and are willing to wait for it, use LZMA. If you want to squash things a little, but don’t have much time, GZIP and ZIP will work just fine. Compress (or nCompress) though, seems to be pretty useless for all but the really impatient.
Further Work
There are some further tests I’d like to perform with these tools. I think it would be interesting to see how they perform with different size files of the same type. I believe that the larger the file, the better the compression wil be, and this will be good to test to see if it’s better to tar all of my backup before compressing it.
Some of these tools are also available in parallel versions and that will be interesting to run on my little 7 node cluster.
There are a mass of other tools available for compression. Some of which are reportedly extremely efficient i.e. PAQ8. These would be interesting to try too, but I hope you agree, the tools I chose are probably the most common and the tests provide interesting results for you to make a more informed decision about which to use.
I welcome any ideas about other statistical methods for displaying the data, for those who are interested and good with stats.
Happy Compressing!



very helpful post — thank you!
Thanks, that was I need to known to decide which compressor I should use for backup.
Regards.
Thanks.. its very helpfull information..
can i get your journal / paper of this post??have u published this??? i need for my research.. as you know, its very hard to get paper about linux.
Muy útil el articulo.
Estaría bueno que también pongas una comparación con KGB ya que aunque no viene de caja daría una idea de compresiones agresivas.
saludos
What about 7z? I know its as slow as bzip2, but it does give smaller sizes.
as far as know, 7z use LMZA algorithm
Thanks !
Just needet this information to decide wich format best to use to compress a 200GB Dataset ;)
[...] un altro sito, ho trovato test molto simili che includo per completezza, i test sono stati realizzati [...]
[...] http://blog.terzza.com/linux-compression-comparison-gzip-vs-bzip2-vs-lzma-vs-zip-vs-compress/ [...]
I found NanoZip with “lzhds_parallels_extra” and “standard memory usage” settings greatly outperforms all the above mentioned algorithms in terms of compression ratio and speed (four times faster than RAR compression with 10% better ratio in my experience) even in it’s alpha state. Try it at http://www.nanozip.net/
Thanks for the post, Its very useful.
A very good article. Although it did not deal with the algorithms ( which is actually good), it was able to convey in a clear way the facts that we must keep in mind before choosing these tools.
Great work!!
One thing I think might be relevant to add, is a plain ‘cp’ on the same system that these tests were performed.
That should (probably) set the ’100%’ for speed and size.
It would probably more or less take the harddisk ‘fragmentation’ into account too.
-Well, nothing is certain, because CP *could* be slower, as it’s writing more bytes to the disk than the compression programs, so a quick compression program would have a chance of outperforming CP, if the algorithm is simple.
I look forward to seing nanozip results too. :)
…When I think about it, it might be interesting to create two simple programs in C, for comparing:
1: a program that allocates a 1MB buffer, and then writes the contents of this (random-data) buffer to a file, until the file is the same size as the total size of the ‘files to compress’.
2: a program that allocates a 1MB buffer, and then reads the contents of all the ‘files to compress’.
The second program would probably not be spending exactly the same time as if for instance doing a cp -R * /dev/null; I believe that it might be quicker.
It’s quite simple…
#include
int main(int argc, const char *argv[])
{
uint8_t *buffer;
FILE *fp;
size_t l;
size_t s;
s = 1024 * 1024;
buffer = malloc(s);
if(buffer)
{
fp = fopen(“/tmp/testfile”, “w”)
if(fp)
{
l = 1234567890; /* hardcoded filesize (for this example) */
while(l)
{
l = (l > s) ? (l – s) : s;
w = fwrite(buffer, sizeof(*buffer), s, fp);
if(w != s)
{
return(-1);
}
}
fclose(fp);
return(0);
}
}
return(-1);
}
…Probably made a few boogs, but you get the idea.
If you are looking for the best tool to archive things, than looks like the winner is bzip2: better compression (smaller file) than gzip while doing it faster. Yes, decompression is slow, but who cares, it we just archive things.
Thanks for the helpful post!
Quick note: Zip and gzip use the same compression algorithm (Lempel-Ziv) [1], so it is not surprising that they have similar results. The biggest functional difference is that zip will handle separate file while gzip works on a single file. [2] My understanding is that zip applies its compression algorithm to each file separately so you might expect different results between zip+tar and simply running zip on a directory tree. If there are lots of similarities between files, you can expect zip+tar (or gzip+tar) to produce a smaller archive than zip alone, since using tar, creates one big file allowing the algorithm to take advantage of similarities across files. The trade-off, though, is that zip allows you to uncompress the files individually so you could recover a single file from an archive without decompressing the entire thing.
Sources:
[1] gzip man file
[2] http://www.differencebetween.net/technology/difference-between-zip-and-gzip/