— Creating a bulk website downloader in Perl

Introduction

This tutorial will show you step-by-step how to create a bulk website downloader in Perl. For Red Hen projects, this is useful for downloading subtitle files or transcripts. Here, I will use the Fox News transcripts to the show Hannity for an example.

Preparation

You need to have a perl environment. In Linux, this is usually already installed. Thus if you have access to a linux machine, you can just work on that machine. If you have a Windows desktop, you may want to install Strawberry Perl. You also may want to use an editor that has syntax highlighting for Perl, so you can more easily spot errors such as quotation marks left open or similar problems. I recommend using Notepad++.

Researching the website

First of all we have to get an idea as to how the data is organized on the website we are looking to download. Here, we are interested in Fox News' Hannity, so we will start at http://www.foxnews.com/on-air/hannity/transcripts.
Scrolling down the page we get to a section of transcripts and a navigation element that lets us access pages listing older transcripts:

Let us first take a look at these by clicking on the number 2. Now observe the address bar in your browser. The new URL is
http://www.foxnews.com/on-air/hannity/transcripts?page=1
That looks promising. Let us click on the number 3 to be sure. The new URL is 
http://www.foxnews.com/on-air/hannity/transcripts?page=2
Okay, we can see a pattern. To get to older shows, we can just increment the number at the end of this URL. Let us try that in your browser address bar. Change the number to 50. Change it to 250. That show really has been running for some time, apparently. Let us change it to 1000.
http://www.foxnews.com/on-air/hannity/transcripts?page=1000
Okay, that page does not contain any transcripts. But more interestingly, it still has the navigation element on it, and this tells us what the last page is that still contains transcripts:

At the time of writing this, it is the page 308. When I click 308 it takes me to
http://www.foxnews.com/on-air/hannity/transcripts?page=307
True, page 1 was actually the second page. I wonder, whether they just started counting at 0 (a common thing in computing). Let us try:
This actually is the same page as if the "?page=0" part was missing, i.e. the page we started with. That is even better, because it means we do not need to give special treatment to the first page.
Now let us take a look at how to find the actual transcript. In order to do so, please follow the following steps:
  1. Click on one of the links to a transcripts and write down (or copy/paste) the address in the browser address bar. I chose the following:
  2. http://www.foxnews.com/transcript/2016/05/13/trump-paul-ryan-doing-good-job-uniting-gop-amazon-ceo-using-washington-post-for/
  3. Go back to the page on which you clicked the link and right-click somewhere on the background of that page. Your browser (depending on the browser version) should offer you a menu with a few option. One should read something like "View page source". When you click on that, you will get a new tab with the source code of that website. Don't worry, you do not need to understand what is written there!
  4. Now press CTRL-F to search on that page and look for the link you saved earlier. In this case we are lucky, the link is clearly visible here, twice in fact:

    You may not always be that lucky, so sometimes you have to search for parts of the URL (in this case I would have tried "trump-paul-ryan-doing-good" for a start). What we are interested in here is the context of this URL to determine how we can find it automatically. Since we have two of them and we need only one, we will take the one with more context around it, i.e. the second one. My guess is that all links of this type on this website look the same and we can just look for the surrounding elements in the same line. So please copy that entire line somewhere for later reference. Mine looks like this:
<h3 class="title"><a href="http://www.foxnews.com/transcript/2016/05/13/trump-paul-ryan-doing-good-job-uniting-gop-amazon-ceo-using-washington-post-for/">Trump: Paul Ryan doing a good job of uniting GOP, Amazon CEO using Washington Post for political power</a></h3>
This concludes the research phase. We now know everything we need to configure the downloader or even to write our own.

Configuring the downloader for Fox News shows

You can use the file I prepared. I is available from the RedHen repository at GitHub: https://github.com/RedHenLab/website_downloader. If you wish to work with Fox News, then not many changes are needed.
Find the following passage in the downloader:
  1. # Start here with configuration
  2. my $starturl = "http://www.foxnews.com/on-air/hannity/transcripts?page=";
  3. my $startindex = 0;
  4. my $endindex = 307;
  5. my $targetfolder = "transcripts_fox_hannity";
  6. my $beginningoffilename = "transcript_fox_hannity_";
  7. # End of show-specific configuration. For Fox News transcripts, you should not need to change anything below here.
Most of this should be self-explanatory, but I will give you a run-through line by line:
  1. A line starting with # is a comment in Perl. This helps you understand what does what in the program.
  2. Here we set the variable $starturl to the URL of the overview pages we identified in our research above, minus the number at the end, which we want the software to increment automatically.
  3. This tells the program where to start counting - here we start at page 0, the first page.
  4. This tells the program where to stop counting - so 307 is the last page it should look at, which corresponds to page 308 on the website.
  5. This tells the program the name of the folder/directory in which we would like to store the transcripts. So once you run the downloader, a folder transcripts_fox_hannity should show up in the directory where you started the program.
  6. This tells the program the beginning of the filename. Later in the program, year, month, day and title are added, so that our filename for the show seen above looks as follows: transcript_fox_hannity_2016-05-13_trump-paul-ryan-doing-good-job-uniting-gop-amazon-ceo-using-washington-post-for.html
  7. Again, the line with the # is a comment providing orientation in the file.
If you set these 5 values, you should be good to go, possibly also for other shows on Fox News. To start the program, depending on your system you either double-click it or you go to the command line and run it from there. This may look as follows:
C:\Daten\redhen>perl website_downloader.pl
Looking at page No. 0
Downloading transcript http://www.foxnews.com/transcript/2016/05/13/trump-paul-ryan-doing-good-job-uniting-gop-amazon-ceo-using-washington-post-for/
Downloading transcript http://www.foxnews.com/transcript/2016/05/11/newt-gingrich-endorses-trump-london-new-muslim-mayor-doubles-down-on-trump/
Downloading transcript http://www.foxnews.com/transcript/2016/05/10/corey-lewandowski-explains-key-to-trump-vp-search-huckabee-reacts-to-primary/
Downloading transcript http://www.foxnews.com/transcript/2016/05/09/rick-perry-ben-carson-explain-why-came-around-to-trump/
Downloading transcript http://www.foxnews.com/transcript/2016/05/06/jan-brewer-illegal-immigration-has-extraordinary-costs-how-would-trump-take-out/
Downloading transcript http://www.foxnews.com/transcript/2016/05/05/gingrich-speaker-ryan-made-big-mistake-priebus-trump-and-ryan-to-sit-down-talk/
Downloading transcript http://www.foxnews.com/transcript/2016/05/04/hannity-why-trump-became-presumptive-gop-nominee-trump-jr-talks-father-historic/
Downloading transcript http://www.foxnews.com/transcript/2016/05/03/carson-jindal-goolsbee-and-huckabee-react-to-trump-big-indiana-win/
Downloading transcript http://www.foxnews.com/transcript/2016/05/02/trump-it-just-all-ends-with-indiana-clinton-reservation-remark-was-derogatory/
Downloading transcript http://www.foxnews.com/transcript/2016/04/29/cruz-trump-boehner-and-clinton-part-same-corrupt-system-carly-fiorina-on/
Looking at page No. 1
Downloading transcript http://www.foxnews.com/transcript/2016/04/28/can-gop-unite-behind-any-candidate/
Downloading transcript http://www.foxnews.com/transcript/2016/04/27/will-cruz-hail-carly-pass-work/
If the program does not start and you get an error message you do not understand, make sure you closed all quotation marks and you did not omit any of the semicolons at the end of a line.

Detailed system description

There is no need to understand everything in here, but it sure helps if you want to create downloaders for websites other than Fox News. This section assumes you have read about the configuration options in the previous section. I will pick out a few things to explain, but of course this tutorial cannot replace an introduction to programming in Perl or to regular expressions.

Regular expressions

Below the configuration we find two lines with regular expressions that are used to find the URL and some information from the URL we are going to use for a filename:
my $urlsearchpattern = qr/      <h3 class="title"><a href="(http:\/\/www\.foxnews\.com\/transcript\/.*?)"/; # The part in brackets defines the URL
# The URL looks like this: http://www.foxnews.com/transcript/2016/05/13/trump-paul-ryan-doing-good-job-uniting-gop-amazon-ceo-using-washington-post-for/
my $fileinformationsearchpattern = qr/transcript\/([0-9]{4})\/([0-9]{2})\/([0-9]{2})\/(.*?)\/?$/; # This needs to yield year, month, day and title (in this order!). Otherwise, changes are necessary below.
 
Regular expression are a very powerful way of matching patterns and there are many tutorials on the web to teach yourself the exact workings of them. Here, our regular expressions are enclosed in "qr/.../".
The first one is a copy/paste job of the common beginning of the transcript URLs, including six leading spaces. Since we are enclosing the regular expression in slashes, the program would regard an ordinary slash ("/") in our pattern as the end of the pattern, so in order to be able to use them in the pattern we have to "escape" them, i.e. write a backslah ("\") in front of them. Furthermore, dots have a special meaning in regular expressions - they stand for "any character" - and thus should also be escaped with a backslash if we are looking for actual dots. Towards the end we find the expression ".*?". This means "any character (".") any number of times ("*") but only as few as possible ("?"). Basically, we want to have anything that comes before the first closing quotation mark. Brackets are used to indicate which part shall be extracted from the pattern - in this case they enclose the URL.
The second regular expression is a bit more complicated. Remember the URL looks like this:
http://www.foxnews.com/transcript/2016/05/13/trump-paul-ryan-doing-good-job-uniting-gop-amazon-ceo-using-washington-post-for/
In our pattern,
[0-9] stands for "any digit",
{4} directly stands for "4 times",
() brackets around something means that we will use it in a variable.
So we have 4 digits for the year, 2 for the month, and 2 for the day. Plus we have something ("(.*?)") between the next two slashes; again this is an arbitrary number of characters, but the smallest possible match. This should give us the shortened title that is part of the URL. As it turns out, in 2010 there was a change in the URL format, so the final slash had to be made optional. The $-symbol stands for the end of the string.

The first loop

for (my $i = $startindex;$i <= $endindex ; $i++) {
 
This tells the computer to create a counter called $i which starts at the value of $startindex (in our case 0) and increments it till it reaches the value of $endindex (in our case 307). For every increment, the computer executes the code between the curly bracket following the for-statement and the closing curly bracket, which in our case stands at the end of the program.
The next parts should be understandable due to the comments:
# Let us print the current status so we know what the program is trying to do:
print "Looking at page No. $i\n"; # The symbol \n stands for a new line
# The following line downloads the content of $starturl to the variable $startpage:
my $startpage = get($starturl.$i); # The dot concatenates two items, so in this case it adds whatever value $i currently has (from 0 to 307) to the string given as $starturl above
# Now $startpage contains the source code of the webpage. Let us now search it for the pattern we identified as pattern above.
my @transcripturls = $startpage =~ /$urlsearchpattern/g;
# Now @transcripturls is a list of all the transcript URLs that were found on the startpage. Let us run through this list and do something with each of these URLs:

The second loop

Again, the idea should become clear from the comments - we basically determine the details for the filename, download the webpage and then write it to disk as a file.
foreach my $transcripturl (@transcripturls) {
    # We set the filename variable to the beginning of the filename set above every time we look at a new transcript.
    my $filename = $beginningoffilename;
    # To complete the filename, we will need year, month, day and title. These are found by the pattern described above.
    my ($year, $month, $day, $title) = $transcripturl =~ /$fileinformationsearchpattern/;
    $filename .= $year."-".$month."-".$day."_".$title.".html"; # The operator .= means that all this is concatenated to the variable $filename
    # Now let us check if that file exists already. If so, we inform the user and go to the next transcript:
    if (-e $filename) {
        print "Transcript already there: $filename\n";
        next;
    }
    # Again: Print the status:
    print "Downloading transcript $transcripturl\n";
    # The following line does the actual download:
    my $transcripthtml = get($transcripturl);
    # Now let us print the transcript to file:
    open my $fh, ">", $filename or die("Could not open file. $!");
    print $fh $transcripthtml; # This means "print to the file represented by the filehandle $fh the content of the following variable: $transcripthtml"
    close $fh;
}

Final remarks

This information and the software comes without any warranty. Use at your own risk. Please note that some websites do not like it if people bulk-download stuff and may ban you. The software is available from the RedHen GitHub repository: https://github.com/RedHenLab/website_downloader
Comments