Posts Tagged ‘Linux’

Linux Compression Comparison (GZIP vs BZIP2 vs LZMA vs ZIP vs Compress)

Sunday, August 30th, 2009

                              
I’ve been looking at backing up some data from an old hard drive recently and would like to compress it to use less CDs. Normally I’d just use GZIP for compression but a friend of mine swears by BZIP2. Knowing that my linux distro sports at least 4 different compression tools “out of the box”, I thought it’s time to get some numbers. Bring on Compression Wars!

The idea’s simple. Gather together a variety of compression tools, test them head-to-head against a variety of file types and see how the perform. There needs to be a few different types of file types involved as certain files compress easier than others. For example, text files should compress alot more than video due to the fact that video codecs already contain compression algorithms.

The sort of things I’m backing up are: Music (mainly MP3s), Pictures (mainly JPEGs), Videos (a mixture of MPEGs and AVIs / DIVXs), and Software (Both in the form of binary files and source code). I have therefore split the test into the following categories:

  • Binaries
    • 6,092,800 Bytes taken from my /usr/bin director
  • Text Files
    • 43,100,160 Bytes taken from kernel source at /usr/src/linux-headers-2.6.28-15
  • MP3s
    • 191,283,200 Bytes, a random selection of MP3s
  • JPEGs
    • 266803200 Bytes, a random selection of JPEG photos
  • MPEG
    • 432,240,640 Bytes, a random MPEG encoded video
  • AVI
    • 734,627,840 Bytes, a random AVI encoded video

I have tarred each category so that each test is only performed on one file (As far as I’m aware, tarring the files will not affect the compression tests). Each test has been run from a script 10 times and an average has been taken to make the results as fair as possible. The things I’m interested in here are compression and speed. How much smaller are the compressed files and how long do I have to wait for them to compress / decompress.

Although there are many compression tools available, I decided to use the 5 that I consider the most common. GZIP, BZIP2, ZIP, LZMA, and the linux tool Compress. One of the reasons for this test is to find the best compression, so where there was an option, I have chosen to use the most aggressive compression offered by each tool.

Here’s what I found:

ratios

This graph show the size of the compressed file as a percentage of the original, so the smaller, the better.

So it seems BZIP2 does outperform GZIP all round and the lesser used LZMA does even more so! I was quite surprised how well the binaries compressed but the results for the {MP3, JPG, MPEG, AVI} collection of files shows that there is little to no point of trying to compress these formats, they are already pretty optimal. Something else interesting was with the Compress tool (nCompress). Using the {MP3, JPG, MPEG, AVI} formats, it churned away for a while trying to squash these then, without producing an error, exited leaving only the original file. I presume that the compressed versions were larger than the originals and the is some logic stating “If output_size > input_size; return input”. The clear winner here though is LZMA.

So now we know how will these tools compress, how long do they take to function?

zip

Notice that these speeds are in MB / Second, so the larger, the better. Although Compress is the worst at its job for compression, it’s great if you’ve got alot of files and are in a rush. and GZIP and ZIP competing pretty much neck and neck.As good as LZMA is at it’s job though, it likes to take it’s time about things running at just 1MB/Sec or less.

What about decompression. If I’m uploading some large files to my webserver for people to download, I might not mind waiting an hour to compress them really small to save on bandwidth, but I don’t want to do this at the expense of my users. could you imagine, “click here to download, then wait 4 hours for it to decompress…”. Well here’s the results:

unzip

Again speeds here are in MB / Second (Size of the compress file over time), so the larger the better. My Friend was right before to state that BZIP2 offers better compression than GZIP, but, as shown here, at the expense of the user’s time, with BZIP2 taking over 6 times longer than GZIP to decompress things. That’s the difference between waiting 10 minutes or an hour for the same files to decompress! Even the mighty LZMA is sometimes nearly twice as fast as BZIP2.

Test Bed

Just to put these tests into perspective for you, the machine I ran these on has a 3.2GHz P4 with 1GB of RAM. I’m sure it would run better on newer machines but I believe the results are a good indication or the ability of these tools.

Conclusion

The graphs speak for themselves really. If you need heavy compression and are willing to wait for it, use LZMA. If you want to squash things a little, but don’t have much time, GZIP and ZIP will work just fine. Compress (or nCompress) though, seems to be pretty useless for all but the really impatient.

Further Work

There are some further tests I’d like to perform with these tools. I think it would be interesting to see how they perform with different size files of the same type. I believe that the larger the file, the better the compression wil be, and this will be good to test to see if it’s better to tar all of my backup before compressing it.

Some of these tools are also available in parallel versions and that will be interesting to run on my little 7 node cluster.

There are a mass of other tools available for compression. Some of which are reportedly extremely efficient i.e. PAQ8. These would be interesting to try too, but I hope you agree, the tools I chose are probably the most common and the tests provide interesting results for you to make a more informed decision about which to use.

I welcome any ideas about other statistical methods for displaying the data, for those who are interested and good with stats.

Happy Compressing!

Find a decent untaken domain with Python

Wednesday, August 26th, 2009

                              
Ever tried using whois? Chances are, most (if not all) the domains you looked for were already taken, either as websites or just by domain squatters hoping to rip off some innocent domain consumer.

Well if you’re looking for a new domain and have a mental block on which name to choose, give this little script a try. It’ll randomly select a word from the linux dictionary, check if that domain is taken or not, and repeat.

Edit the lines near the top for different functionality as so:

  • pause = 1.0
    • The amount of seconds to wait between each whois check (you don’t want to be DoSing the whois servers.)
  • printtaken = False
    • Set this to True to print out every domain that is tried, or False to only print out the Available domains.
  • doubleword = False
    • To make your searches more probable, set this to true and it will concatenate 2 random words for you domain i.e. http://blue-eggs.com
  • nomatch = “No match for”
    • The program looks through the output of the whois command to find this string, which signifies an available domain. This may need to be changed when searching through different TLDs.
  • error = “Maximum Daily connection limit reached”
    • Similar to the above, but stops the program if there is the error found.

Run this program either with a TLD as an argument, or just on its own for .com TLDs. To close it just use Ctrl-c

#!/usr/bin/env python
# Name    : DomCheck
# Author  : Mike Terzza http://terzza.com
# Version : 0.2
# Date    : 26-08-09

import commands, string, random, time, sys

#Edit these options to change the program behaviour
###################################################
pause = 2.0
printtaken = False
doubleword = False
nomatch = "No match for"
error = "Maximum Daily connection limit reached"
###################################################

if len(sys.argv) > 1:
    tld = sys.argv[1]
else:
    tld = ".com"

words = open("/usr/share/dict/words").readlines()
while True:
    randword = words[random.randint(0, len(words))]
    if doubleword:
        randword += "-" + words[random.randint(0, len(words))]
    available = False
    temp = ""
    for letter in randword:
        if letter.lower() in string.ascii_lowercase + ".-":
            temp += letter.lower()
    output = commands.getstatusoutput("whois " + temp + tld)
    for line in output[1:]:
        if nomatch in line:
            available = True
        if error in line:
            print error
            sys.exit(0)
    if available:
        print "%s AVAILABLE" % (temp + tld + ((40 - len(temp))* "-"))
    else:
        if printtaken:
            print temp + tld
    time.sleep(pause)

Example outputs:


$ ./domcheck.py
phonologists.com---------------------------- AVAILABLE
recifes.com--------------------------------- AVAILABLE
axums.com----------------------------------- AVAILABLE
festivitys.com------------------------------ AVAILABLE
tillages.com-------------------------------- AVAILABLE
acetylenes.com------------------------------ AVAILABLE
bissaus.com--------------------------------- AVAILABLE
livelinesss.com----------------------------- AVAILABLE
apotheosizes.com---------------------------- AVAILABLE
dependabilitys.com-------------------------- AVAILABLE
sprightlier.com----------------------------- AVAILABLE

$ ./domcheck.py .co.uk
barentss.co.uk-------------------------------- AVAILABLE
stalenesss.co.uk------------------------------ AVAILABLE
lobbed.co.uk---------------------------------- AVAILABLE
raiments.co.uk-------------------------------- AVAILABLE
pluralize.co.uk------------------------------- AVAILABLE
outstretches.co.uk---------------------------- AVAILABLE
wickednesss.co.uk----------------------------- AVAILABLE
coincided.co.uk------------------------------- AVAILABLE
cuspid.co.uk---------------------------------- AVAILABLE
thirded.co.uk--------------------------------- AVAILABLE
haemoglobins.co.uk---------------------------- AVAILABLE
propellants.co.uk----------------------------- AVAILABLE

WARNING

If you run this too often or for too long you will be temporarily (and maybe even permanently) blocked from the whois lookup database. Use sparingly and at you own risk.

This is written for linux, but could be used on Windows if you install a command line based whois tool, and acquire a text file full of words. Add a comment if you need any help with it.

Download the source file here :

domcheck.tar.gz

[python]