Debian Package of the Day (static archived copy)

lbzip2: parallel bzip2 utility

November 12th, 2009 edited by ana

Article submitted by ERSEK Laszlo. DebADay needs you more than ever! Please submit good articles about software you like!

lbzip2 is a multi-threaded bzip2 compressor/decompressor utility that can be used on its own, in pipelines, or passed to GNU tar with the –use-compress-program option (or with the –use shorthand).

The main motivation for writing lbzip2 was that I didn’t know about any parallel bzip2 decompressor that would exercise multiple cores on a single-stream bz2 file (i.e. the output of a single bzip2 run) and/or on a file read from a non-seekable source (e.g. a pipe or socket). Thus lbzip2 started out as lbunzip2, but with time it gained multiple-workers compression and single-worker decompression features. Due to the input-bound splitter of its multiple-workers decompressor, it should scale well to many cores even when decompressing.

Target audience

Originally, the target audience for lbzip2 was experienced users and system administrators: up to version 0.15, lbzip2 deliberately worked only as a filter. Now at 0.17, lbzip2 is mostly command line compatible with bzip2, except it doesn’t remove or overwrite files it didn’t create. If lbzip2 will have a chance to enter the Debian alternatives system, as an alternative for bzip2, I’ll add this feature. In any case, you are encouraged always to verify lbzip2’s output manually before (or instead of automatically) removing its input, both when compressing and when decompressing. I also recommend perusing the README, installed as /usr/share/doc/lbzip2/README.gz on Debian, before switching over to lbzip2 eventually.

Usage examples

As lbzip2 was chiefly created for speeding up decompression of single-stream bz2 files and/or for speeding up decompression from a pipe, I’ll provide examples of decompression first. Basically all free software tarballs should be available on the net as tar.bz2 files, I’ll choose (not surprisingly) a kernel tarball.

The “traditional” method:

wget \
  http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.31.1.tar.bz2
tar --use=lbzip2 -x -f linux-2.6.31.1.tar.bz2

The overlapped method:

wget -O - \
  http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.31.1.tar.bz2 \
| tee -i linux-2.6.31.1.tar.bz2 \
| tar --use=lbzip2 -x

If wget fails to download the tarball for some reason (at which point at least tar will complain), you should remove the partially decompressed tree and fall back to the traditional method. To avoid losing the already downloaded part, pass -c to wget.

Another example might be the import of a Wikimedia Dump file, perhaps with a pipeline like this:

lbzip2 -d < enwiki-latest-pages-articles.xml.bz2 \
| php importDump.php

Finally, a compression/backup example with verification at the end:

tar --format=pax --use=lbzip2 -c -f tree.tar.bz2 tree
tar --use=lbzip2 --compare -f tree.tar.bz2 -v -v

Hypothetically, with lbzip2 as the configured bzip2 alternative, we should be able to replace –use=lbzip2 with the well-known -j GNU tar option.

Comparison with other bzip2 utilities

I posted a longish mail with feature analyses and performance measurements to the debian-mentors maling list. To reiterate what I said there: fundamentally, lbzip2 was created to fill a performance gap left by pbzip2.

After working on lbzip2 for a while, I found out that p7zip does in parallel the decompression of single-stream bz2 files, but (the last time I checked) it couldn’t scale above four threads, and it refused to read bz2 files from a pipe.

Bzip2 compression and decompression performance is very sensitive to the cache size that is dedicated to a single worker thread (i.e. a single CPU core). To my limited knowledge, this implies that among commodity desktops, lbzip2 performs best on multi-core AMD processors.

lbzip2 does have shortcomings. They are either inherent in the design or I deem then unimportant. I tried to document them all. Please read the debian-mentors post linked above, the README file, and the manual page.

As said above, I didn’t originally intend lbzip2 as a drop-in replacement for bzip2. Even though it is almost there now, you should nonetheless get to know it thoroughly before deciding to switch over to it.

Availability

Various versions of lbzip2 are available for Debian (squeeze and sid) and Ubuntu (karmic and lucid).

You should be able to install lbzip2 on lenny too; it shouldn’t break anything. I used the following commands:

cat >>/etc/apt/sources.list <<EOT
deb http://security.debian.org/      testing/updates main
deb http://ftp.hu.debian.org/debian/ testing         main
EOT
apt-get update
apt-get install lbzip2

Upstream releases are announced on the project’s Freshmeat page. I distribute the upstream version to end-users from my recently moved home page, which also links to other distributions’ lbzip2 packages.

A development library version is very unlikely. You can work around this by communicating with an lbzip2 child process over pipes via select(), and by checking its exit status via waitpid() after receiving EOF. This is not an unusual method; see, for example, gpg’s many –[^-]*-fd options.

End-user stress-testing

I encourage you to test lbzip2. The upstream README describes the test method in general; let me instantiate that description here specifically for Debian.

Necessary packages, in alphabetical order:

bzip2
dash
gcc
lbzip2
perl

Recommended packages, in alphabetical order:

p7zip-full
pbzip2

Create a test directory (you will need lots of free space under that directory), and under it a well-compressible big file. For example:

mkdir -m 700 -v -- "$TMPDIR"/testdir
tar -c -v -f "$TMPDIR"/testdir/testfile.tar /usr/bin/ /usr/lib/

Then issue the following commands, utilizing the test file created above. As this could take several hours, I suggest entering a screen session first. Your machine should be otherwise unloaded during the test, both IO- and CPU-wise.

cd /usr/share/lbzip2
dash test.sh "$TMPDIR"/testdir/testfile.tar

Any errors encountered during the test should be either handled or fatally rejected. In particular, utilities refusing to decompress from a pipe are handled.

Estimated disk space usage: when writing this article, I executed the above commands with a 100 MB test file. (You should aim at least at 1 GB.) The test directory ended up being 250 MB in size. M stands for 2²⁰, G stands for 2³⁰.

Estimated time span: supposing

your machine has N cores (each with a dedicated L2
cache),
the file you use for testing lbzip2 is S GB big,
and bzip2 takes T seconds to compress a 1 GB test file with similar contents,

then the full test should take around
S * (1879 + 2098 * 2 / N) * T / 240
seconds.

Estimated peak memory usage: N * 50 MB should be a very safe bet.

To view the test report:

less -- "$TMPDIR"/testdir/results/report

The only obscure entries in the table should be the “ws” ones. They mean “workers stalled” and give a percentage of how many times the (de)compressor worker threads tried to start munching a block but had to go to sleep because there was no block to munch. Anything above 1-2% usually implies some bottleneck and shows that lbzip2 couldn’t fully exhaust your cores. This shouldn’t occur, but if it does and lbzip2 and pbzip2 have performed similarly in the compression tests, then the bottleneck is in your system, not lbzip2.

The lzip2 package has been available .

Posted in Debian, Ubuntu |

4 Responses

Samat Jain Says:
November 12th, 2009 at 6:17 pm
Does lbzip2’s performance make bzip2 competitive again compared to other next-gen compressors, like lzma and xz (the latter of which has SMP support)?
Laszlo Ersek Says:
November 12th, 2009 at 10:59 pm
@ Samat Jain, November 12th, 2009 at 6:17 pm:

http://www.reddit.com/r/programming/comments/9wapf/boring_if_youre_a_casual_or_frequent_bzip2_user/c0et1ym

http://www.reddit.com/r/programming/comments/9wapf/boring_if_youre_a_casual_or_frequent_bzip2_user/c0etjq6

I recommend lbzip2 for the case when you’re bound to or want to use bzip2, for whatever reason. I’m not saying you should choose lbzip2 over other compressor families.

For your LZMA related needs, you may want to check out llzip-0.02, downloadable from my site. It is a compressor-only lzip parallelization, based on lbzip2 and lzlib. Antonio Diaz Diaz, author of lzip, goaded me to write it :)
tsakf Says:
November 17th, 2009 at 7:28 pm
I liked the artice very much, and the utility is fantastic. Added it to my library
http://fosslib.tsakf.net/record/285
Jakub Says:
November 21st, 2009 at 12:05 pm
Great tool, thank you for excellent description!

Search

Archives

Meta:

Blogroll

Recent Posts