GNU wget: Get all the web content you like on your local machine
October 17th, 2007 edited by Patrick MurenaArticle submitted by Chris Camacho. We are running out of articles! Please help DPOTD and submit good articles about software you like!
Wget is so flexible you’ve probably been using it for years without knowing it, many scripts use it because it is a boilerplate method of grabbing files, it will even automatically retry under certain circumstances…
Probably the best compliment I can pay to it, is that script writers can use it and then forget about it.
It’s one of those great tools that makes *nix so great, its simple, it does what it says on the tin, and like many other pieces of unix “glue”, it’s robust.
Another cool thing about wget is that it’s non-interactive, which means you can start a new download, disconnect from your current session and find your downloads the next time you connect again.
How to use it
The most simple way to invoke wget is by typing wget URL/fileName
$ wget https://debaday.debian.net/contribute/
If you typed in this command you now have a file called index.html in the directory you where in while typing. This file contains the contribute page of the Debian package of the day blog. Read It, DPOTD needs you ;)
Get a directory hierarchy
To get the full content of a directory and his subdirectories you will need to specify wget that it should download your URL recursively. To do this you will need to add the -r option:
$ wget -r https://debaday.debian.net/
This command will generate a local mirror of the debaday blog. Note that wget respects the robots.txt file by default, if it exists. This means it will not download directories and files excluded by the robots.txt file.
Multiple URLs
Wget supports multiple URLs. You can either specify them in a file (one URL per line) or specify them in the command line (space separated).
$ wget url1 url2 ... urlN
or specify the URL container with the -i option
$ wget -i filePathAndName
Other options
Wget has a lot more options, you can for instance use:
- -l for how deep the recursive download should go, the default depth is 5.
- -c is invaluable as it allows you to continue an interrupted download
- -O let’s you specify a target output file (-O fileName)
There are plenty other options in wget, the best way to know them is to read it’s rich man page. For those vista refugees amongst you try typing the following into a terminal ;)
$ man wget
Availability
As wget is part of the GNU project we assume it’s part of most Linux distribution. Never the less, official Debian and Ubuntu package are available:
Community & developers
GNU wget is currently being maintained by Micah Cowan. The original author of GNU Wget is Hrvoje Nikšić.
Links
Posted in Debian, Ubuntu | 6 Comments »