Web Scraping for STEP past papers and solutions, Part II, a bug
There was a major bug in my first attempt at scraping!
Other posts in series
A bug
Earlier this week, I tried to open one of the files containing a solution to a past STEP paper. To my surprise, it did not open. I tried several other files, and none of them opened. A big lesson here is to actually check the files have properly downloaded! I just assumed that the files successfully downloaded because I saw the correct filenames appear in the directory. (This is somewhat ironic because one of my takeaways at the end of the previous post was to not make assumptions…)
So I had to go back to my previous attempt and try to work out what went wrong. I compared my code to various other examples online, and I could not see the error. I then tried several other variations that I found while searching, and none of them worked.
Next I tried to look at the object obtained after running r = requests.get(url)
. I first ran print(r.content)
to actually see what I was writing to the files. The output was html for a webpage - that made no sense, since the url directly goes to a jpg file. I was feeling a bit clueless here. I used dir(r)
to see if there were any functions that might help me, but none stood out. One of the methods was url
and in desperation I decided to do run print(r.url)
, expecting just to the url that was inputted into the .get
method. However, the output was:
https://2017.integralmaths.org/login/index.php
Bingo! The problem was now clear. The requests
function is distinct from the selenium objects, so you have to separately log-in to the website for requests
. After some Googling, I found out how you can enter log-in credentials using requests, and it worked! What a relief. The new bits of code are given at the end.
Lessons learnt
- As mentioned above, I really need to absorb the lesson that one should not make assumptions.
- If you’re completely stuck, it might be worthwhile to go through all the methods of the object, to see what can be uncovered.
- There is a limit to my understanding of the code and the modules. Stitching together code from random blogs and stackoverflow works, but I have no real understanding. I did try looking at the documentation for some of the modules I was using, but they are incomprehensibly dense. I am not sure what best practice is here.
- Try to have as much of your work saved, so if there is a bug, you do not need to repeat everything. In this case, I should have saved a list of the urls during my first attempt, so that for any potential future attempts, all I would have to do is loop through these urls and download the images, rather than having to navigate via a browser to find all the urls again.
Code
Below is the new bits of code I had to add into the original code. It may be incomprehensible without the context, but I think the syntax is mostly self-explanatory.
login_url = "https://2017.integralmaths.org/login/index.php"
payload = {
'username': 'mei-step',
'password': 'Stepaea1'
}
with requests.Session() as s:
s.post(login_url, data=payload)
r = s.get(url, allow_redirects=True, stream=True)
open(filename, 'wb').write(r.content)