Objective:

Here I will show you how to extract Data From Site that require authentication/login with CURL. As an example we will extract favorite TV shows schedule from next-episode.net

Material:

CAUTION: dont use this script to ddos their server. I only use this once per day to get my episode listings.

This example shows how you can login to a website and extract data from any page. So for example the site www.next-episode.com/calendar usually lists the worlds top tv shows & when they are playing this month. Instead of all of those TVshows I want to see what shows that I like are playing this month. next-episode has a feature for registered users to have a “watchlist” then anything on your watchlist shows up on the calendar (and nothing else that you dont like).

———————————————————————————————————–

Just to let you know there are 3 ways you can login to a site:

1. using webserver authnetication scheme by using –username and –password with curl: curl --user USER:PASSWORD https://somesite.com

This type of authentication will look like this:

2. by using form data to POST data: curl -c COOKIE.txt --data "user=USER&password=PASSWORD" https://somesite.com

3. by passing the arguments in the URL (this is less secure as the password is visible in the URL so this method is rarely & never used): curl https://somesite.com?user=USER&password=PASSWORD

The most common way is to login using method 2 (POSTING via FORMS) and then its method 1 (requires painful & annoying apache/webserver configurations that are not easily accessible by the developers but more by the system adminstrators ), followed by method 3 (least secure)

———————————————————————————————————–

 

Most websites store cookies when you login, to show that you logged in. Websites that store cookies on your computer can be accessed using this method that I will cover. For more info read: http://stackoverflow.com/questions/12399087/curl-to-access-a-page-that-requires-a-login-from-a-different-page and also  http://unix.stackexchange.com/questions/138669/login-site-using-curl

Basically we are posting (giving information) our authnetication information to the webserver using curl, here is another interesting site on posting – it shows all of the different ways you can post to a server: http://superuser.com/questions/149329/what-is-the-curl-command-line-syntax-to-do-a-post-request

Other sites might have an apache login (where you get a popup window that you fill out your username and password), you can fill out the credentials for those using a simpler method:  curl -u username:password http://example.com  as talked about here: http://stackoverflow.com/questions/2594880/using-curl-with-a-username-and-password

More articles on sending data to a site: http://stackoverflow.com/questions/356705/how-to-send-a-header-using-a-http-request-through-a-curl-call

This script will login to next-episode.net and extract this months calendar for you from the calendar tab. It then saves it to a file and also sends it to a webserver (you can comment out or leave that out). It seds and greps out everything that it needs. Just make sure to edit the username and password variable as described in the comments.

What I first did is I opened up Chrome and I went to https://next-episode.net and then I pressed F12 and went to the Network tab (where you can see the communication between the server and client). You could also see this information with wireshark (for https traffic you would need to be able to decrypt SSL traffic, but for a site that has http you wouldnt need to decrypt as it will all be human readable text). Then I put my username and password in the login section and I clicked on login. At that point.

At this point the Network tab gets its list populated. Near the top of the list you will see the login connection it will be named something along the lines of “login” or “userlogin”, etc… thats where the user credentials get passed to the server (I oranged out my real username and password):

This gives us all of the variables that get passed to the server in order for a connection to be established. We will need to mimic that with curl.

But notice that the form data passes an object called “username” and an object called “password”. That means that we should pass the same objects. In our curls –data (or -d) argument we will pass something along the lines of --data "username=USERNAME123&password=PASSWORD123"  or --data "username=USERNAME123" --data "password=PASSWORD123"  either way works.

As a sidenote, some logins could be more complicated like this:

So the command might look like this curl -c /tmp/COOKIE.txt --data "uid=admin&passwd=admin&submit=login&cloud=first&language=en" https://192.168.1.123/login

Anyhow back to next-episode.net example I first tested my command like so:

Note that the Request header in the above screenshot mention an Origin of https://next-episode.net and a referer https://next-epiosode.net, yet we send our data to https://next-episode.net/userlogin.. Why do I put https://next-episode.net/userlogin instead of the others? After all I did go to the URL without the /userlogin suffix. If you scroll to the top of the Header output we see this.

And here we see that the data was POSTed to https://next-episode.net/userlogin. So thats where we have to login.

So when I logged I looked at the Network access tab to find out what items to pass in as form objects via the –data argument, and also what URL I should pass the information to. After that we ran our command.

Next we can look into the cookie

We can see that the cookie filled up with good stuff. Now we can use this cookie to look into any page of the website that required login & get the data from that page that a logged in user would of got. So for example lets go to the calendar page.

Now that will output all of the output that a logged in user would get. Note that we use -c to write to a cookie and -b to use a cookie. So why did we need to write to a cookie to access the /calendar page? well simply because what if accessing the calendar page writes information to the cookie that is important, I dont know the mechanics of the site, so its best to let the site do with the cookie as it wants. We are basically giving the site write access to the cookie with -c and read access to the cookie with -b.

Now what if we didnt get the desired output? In this case that is not the case. However what if on another site we didnt get the right info. My advice go to Chrome, login to the site with the Network tab recording data & look at the Request Header. Look at all of the different keys and values. Try to match as many as you can. For example the site might be picky about the UserAgent so you can copy the UserAgent and set it. For example the useragent when I used chrome was this  Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.65 Safari/537.36 so I could try doing this (again I would do this if the first run didnt work – in our case it did so we dont need to add this next argument to our curl command – but if we had to it would look like this)

There are many other Header fields that we can set. Check out the curl man page to see how many you can match. Ideally if you were able to successfully login with Chrome, then there should be no reason you cant login with curl – especially if you match all of the Header information that chrome sent. Also as a final tip sometimes there is a third parameter that is sent via -d or –data. So usually it would be like this  --data "username=USERNAME123&password=PASSWORD123" and sometimes you might need to do something like this --date "un=USERNAME123&pw=PASSWORD123&php="  (note its not always the case that the username & password parameter are named “username” & “password”, they could be something else such as “un” and “pw” respectively), as seen in this miniexample:

One last important point to make is that you should always logout. Open up the chrome development tools and go to network tools. Click on the websites logout script. And find the corresponding entry for loging out. Then check out the header that was sent for loging out. Usually its quiet simple. For next calendar my logout command looked like this:

For another site it could like this (as seen in mini example above):

Below is the next episode script. We add an extra argument -s to the curl commands which simply silents the output on the screen. We dont want to see any output from curl, other than curl login in and saving the desired websites raw html data to a file. Then we can use our shell script (using grep/awk/sed) to parse the data for the needed information.

Next Episode Login and Extract Data from Calendar script

Here is the may Next-Episode login and extract FavoriteTV shows for the month script

Example output:  http://ram.kossboss.com/nec/nec.txt (if that doesnt work scroll to the bottom)

WARNING: If the site changes its output layout, I will need to update the script (this had to happen once already) as my awks/sed/greps only work for the time when I tested/edited/wrote the script.  So I will have “tested to work on dates” below.

Tested to work on dates:2015-03-09, 2015-05-24, 2015-06-15

NOTE: if someone wants an explanation at what the sed & grep regular expressions are doing, let me know and Ill append to the article. Right now I just wanted to post the main meat of it.

Then you can use a crontab to run this once per day:

Example output (As of 2015-03-09):

 

Leave a Reply

Your email address will not be published. Required fields are marked *