Search

GNU wget: Get all the web content you like on your local machine

October 17th, 2007 edited by Patrick Murena

Article submitted by Chris Camacho. We are running out of articles! Please help DPOTD and submit good articles about software you like!

Wget is so flexible you’ve probably been using it for years without knowing it, many scripts use it because it is a boilerplate method of grabbing files, it will even automatically retry under certain circumstances…

Probably the best compliment I can pay to it, is that script writers can use it and then forget about it.

It’s one of those great tools that makes *nix so great, its simple, it does what it says on the tin, and like many other pieces of unix “glue”, it’s robust.

Another cool thing about wget is that it’s non-interactive, which means you can start a new download, disconnect from your current session and find your downloads the next time you connect again.

How to use it

The most simple way to invoke wget is by typing wget URL/fileName

$ wget http://debaday.debian.net/contribute/

If you typed in this command you now have a file called index.html in the directory you where in while typing. This file contains the contribute page of the Debian package of the day blog. Read It, DPOTD needs you ;)

Get a directory hierarchy

To get the full content of a directory and his subdirectories you will need to specify wget that it should download your URL recursively. To do this you will need to add the -r option:

$ wget -r http://debaday.debian.net/

This command will generate a local mirror of the debaday blog. Note that wget respects the robots.txt file by default, if it exists. This means it will not download directories and files excluded by the robots.txt file.

Multiple URLs

Wget supports multiple URLs. You can either specify them in a file (one URL per line) or specify them in the command line (space separated).

$ wget url1 url2 ... urlN

or specify the URL container with the -i option

$ wget -i filePathAndName
Other options

Wget has a lot more options, you can for instance use:

  • -l for how deep the recursive download should go, the default depth is 5.
  • -c is invaluable as it allows you to continue an interrupted download
  • -O let’s you specify a target output file (-O fileName)

There are plenty other options in wget, the best way to know them is to read it’s rich man page. For those vista refugees amongst you try typing the following into a terminal ;)

$ man wget

Availability

As wget is part of the GNU project we assume it’s part of most Linux distribution. Never the less, official Debian and Ubuntu package are available:

  • Debian: stable, old stable, testing and unstable
  • Ubuntu: dapper, edgy, feisty and gutsy.

Community & developers

GNU wget is currently being maintained by Micah Cowan. The original author of GNU Wget is Hrvoje Nikšić.

Links

Posted in Debian, Ubuntu |

6 Responses

  1. Wayne S. Winch, Jr. Says:

    Great article! The only thing I would add is that if you are going to download a site for the purposes of creating a local version, use the -k option along with the recursive -r option. It will fixup absolute links for page content within the download and rebase them to work with your local version of the site. True external links will be maintained.

    Keep up the awesome work with Debian Package of the Day!

    - Wayne

  2. David Benjamin Says:

    Very well-written.

    Minor quirp: the Wikipedia link has a trailing slash that breaks it.

  3. Joshua K Says:

    This is a pretty good post. I’m kind of new to *nix, but have heard others make mention of wget. Now that I know how to use it, I’m sure I will with some regularity.

  4. tabrez Says:

    Other options that I use frequently are:
    -P : specify the prefix directory i.e. directory in which the downloaded files should be saved.

    -B : immediately switch to background mode, good for extracting entire websites

    -B : base url, prefix to the urls specified using -i

    -T : timeout value; very important when leaving wget to run over the night.

    –limit-rate : this is second in importance only to -c.

  5. Dennis Says:

    You have a small typeo… At least, I don’t know the “Ubutnu” distribution ;)

  6. Dennis Says:

    Heh, I have a small typo as well :D