This article is sectioned off weird, I apologize for that. But the information is still very good:

* To find out how to download a full Normal website (or full domain if you want), check out the middle section of this article (its sectioned off with horizonal lines, like the one you see above this sentence). Also here is a link to another article of mine from before on this very topic (downloading a full normal website and perhaps a full domain): wget full site

* To find out how to download a full Directory site (like seen in screenshots below), which is also known as an Indexed site or FancyIndexed site, check out the article below (just ignore the middle section which is seperated with horizonal lines – that section is for downloading a normal website)


 

Imagine you stumbled a cross a site that looks like this:

Or like this:

By the way these are called Indexed pages. The prettier ones are called FancyIndexed pages. Basically if the web server doesnt have a webpage index.html or index.php, the web server will instead generate an automatic & dynamimc index.html for you, that will show the contents of the directory like a file browser, where you can download folders – this is very useful for sharing content online, or in a local network.

PS: above pictures are from sites offering free content

How would you download everything recursively?

The answer is WGET! Here is a quick guide on it (have a quick read its fun and you will learn alot):

Regravitys WGET – A Noob’s Guide

http://www.regravity.com/documents/Wget%20%96%20A%20Noob%92s%20guide%20-%20Regravity.com.pdf

Open up a linux box, or cygwin. Make a download directory and cd into it

Here is one of the best ways to run wget (more below):

Now it will take a minute to download everything so its probably best to run it in a screen,nohup,dtach or tmux (one of those programs that allow you to run a script/program without being tied to the putty/terminal shell – so you can close the terminal window your on and still have the download running, and continue doing other things)

A wget within a screen example would be like this from start to finish:

Tip (specifying & making destination/dump folder with command line arguments of wget):

Instead of making a directory and “cding” into i, you can save some typing by using the -P argument. You can make wget make the folder for you and dump all content into that directory (it will still preserve the directory tree structure of the remote webiste your downloading) those steps with the -P. The -P will make the folder and dump the files there (it just wont step you into the directory)

The entire website of www.example.com/books(and every file and folder in there) will get downloaded into /data/Downloads/books.

The other alternative was


 

Downloading a normal website / dont skip index.html files:

If you want to download a full site, that is just offering normal web content. The remove the –reject “index.html*” part so that index.html pages (and any variation of index.html* gets downloaded as well).

So it will look like this:

If you want to dump everything to a specific folder do it like this:

Or with one command:

Note when downloading a web page you have to think about the web page not wanting other people to mirror their site. So they probably put alot of protection against that. Some things that can help are user agent argument and –random-wait argument, both of which are lightly covered below. However links that cover both more indepth are below.

To download normal content website (instead of a file share/file&folder browsing/indexing/fancyindexing website) look at these notes:

After this back we are not back on the topic of downloading from an indexed site (one thats sharing files and folders like in the screenshots at the top)


 

 BONUS (random wait and authenticated pages & continous):

3 more interesting arguments. Followed by the winner commands.

–random-wait: to make your downloads look less robotic, random timer waits a moment before downloading the next file.

–http-user and –http-password: if the website is password protected and you need to login. Doesnt work if the login is POST or REST based.

-c or –continue: resumes a download where it left off. So if a download cancelled or failed, when you relaunch the same command it will try and download where it left off. This will only work if the webserver(source) supports it. If it doesnt then it will just start over. So its hit and miss with -c, but it doesnt hurt to have it.  When you use -c you cant use -nc, because they logically go against each other. -nc states that it will not download the file if the file (or more correct, if a file with the same name) already exists on destination & -c states that it will try to download the file again & resume if it can. so you might ask why is -nc good? well because you could have the same filename but with different content and you dont want to overwrite it. The question is do you always need -c? No you only need it when your resuming the download, but like I said it doesnt hurt to have it. So just leave it on. This site shows -c in action http://www.cyberciti.biz/tips/wget-resume-broken-download.html

 The Winners for best commands:

The end.

More info (trying to not be a bot): Check out user agent, pretending to be a browser. You see running wget half of the time is battling servers against their bot tracking abilities. your not a bot, but wget comes off as a bot. so we have things like -e robots=off, and –random-wait, and user agent command (to find out more about user agent argument check out this link http://fosswire.com/post/2008/04/more-advanced-wget-usage/ )

Also more info on the man page, this man page is really well done: http://www.gnu.org/software/wget/manual/wget.html

Leave a Reply

Your email address will not be published. Required fields are marked *