Debian Package of the Day (static archived copy)

HTTrack: Website crawler / copier

December 16th, 2007 edited by Alexey Beshenov

Article submitted by Zhao Difei. We are running out of articles! Please help DPOTD and submit good articles about software you like!

HTTrack is a powerful tool that allows you to download / mirror a website to a local location.

Basically, HTTrack follows the links of the original website, recursively downloads them to the local directory while re-arranging the hyper-links structure so you can just simply open a downloaded HTML file and browse at the local machine. In contrast, the recursive mirror function of Wget will not rearrange the hyper-links on the web pages you downloaded, so they might still be pointing to remote locations.

HTTrack is a powerful tool but the syntax is very simple, let’s have a look at the basic usage:

$ httrack –help HTTrack version 3.41-3 (compiled Jul 3 2007) usage: httrack <URLs> [-option] [+<URL_FILTER\>] [-<URL_FILTER>] [+<mime:MIME_FILTER>] [-<mime:MIME_FILTER>]

A simple example that copies the debian.org website to the local “httrack” directory:

$ mkdir httrack $ cd httrack/ $ httrack debian.org Mirror launched on Sun, 30 Sep 2007 18:05:40 by HTTrack Website Copier/3.41-3+libhtsjava.so.2 [XR&CO'2007] mirroring debian.org with the wizard help.. * debian.org/intro/about.ro.html (17854 bytes) - OK

HTTrack can also apply download filters, you may have noticed the “*_FILTER” things from the httrack usage line above, the plus sign + means to download a specific patter, and the minus sign - means to avoid download. The following examples (mirroring slashdot) show a simple usage of filters, the first one will not download items from the apple.slashdot.org site, and the second one will not download items which have a MIME image/jpeg type, please notice that you can still view the things you did not download if you have the Internet connection available, because HTTrack will arrange the hyperlinks for you:

$ httrack slashdot.org -apple.slashdot.org* $ httrack slashdot.org -mime:image/jpeg

To download two sites that share lots of common links, you can do:

$ httrack www.microsoft.com www.evil.com

There are still many options and more advanced usages left, interested readers may always read the manual. HTTrack is available in Debian from oldstable Sarge to unstable Sid and for Ubuntu from Dapper to Gutsy.

Posted in Debian, Ubuntu |

4 Responses

Saeid Zebardast Says:
December 16th, 2007 at 5:50 am
Thanks.
you can use wget command for download a website:

wget URL -k -c -r -p
fwiffo Says:
December 16th, 2007 at 9:55 am
Thanks, this one is really useful.
ton Says:
December 16th, 2007 at 11:07 am
“In contrast, the recursive mirror function of Wget will not rearrange the hyper-links on the web pages you downloaded, so they might still be pointing to remote locations.”

this is what -k switch of wget is for.
Evan "JabberWokky" Edwards Says:
December 16th, 2007 at 12:28 pm
I’m assuming that you didn’t explain the difference between wget -k and this. Or did you just not know about the -k switch?

A bit more information would be nice… there is room for improvement with wget, as it misses css includes and such, so this might be a better option, but the only thing you cite as a difference actually isn’t different.

Search

Archives

Meta:

Blogroll

Recent Posts

HTTrack: Website crawler / copier

4 Responses