Scraping websites with Outwit Hub: Step by step tutorial

5 July 2014

Some websites offer data that you can download as an Excel or CSV file (e.g., Eurostat), or they may offer structured data in the form of an API. Other websites contain useful information, but they don’t provide that information in a convenient form. In such cases, a webscraper may be the tool to extract that information and store it in a format that you can use for further analysis.

If you really want control over your scraper the best option is probably to write it yourself, for example in Python. If you’re not into programming, there are some apps that may help you out. One is Outwit Hub. Below I will provide some step by step examples of how you can use Outwit Hub to scrape websites and export the results as an Excel file.

But first a few remarks:

The Outwit Hub app can be downloaded here (it’s also available as a Firefox plugin, but last time I checked it wasn’t compatible with the newest version of Firefox).

Scraping a single webpage

Sometimes, all the information you’re looking for will be available from one single webpage.

Strategy

Out of the box, Outwit Hub comes with a number of preset scrapers. These include scrapers for extracting links, tables and lists. In many cases, it makes sense to simply try Outwit Hub’s tables and lists scrapers to see if that will get you the results you want. It will save you some time, and often the results will be cleaner than when you create your own scraper.

Sometimes, however, you will have to create your own scraper. You do so by telling Outwit Hub which chunks of information it should look for. The output will be presented in the form of a table, so think of the information as cases (units of information that should go into one row) and within those cases, the different types of information you want to retrieve about those cases (the information that should go into the different cells within a row).

You tell Outwit Hub what information to look for by defining the «Marker Before» and the «Marker After». For example, you may want to extract the tekst of a title that is represented as <h1>Chapter One<h1> in the html code. In this case the Marker Before could be <h1> and the Marker After could be </h1>. This would tell Outwit Hub to extract any text between those two markers.

It may take some trial and error to get the markers right. Ideally, they should meet two criteria:

Example: Bossnappings

Some French workers have resorted to «bossnapping» as a response to mass layoffs during the crisis. If you’re interested in the phenomenon, you can find some information from a paper on the topic summarized here. From a webscraping perspective, this is pretty straightforward: all the information can be found in one table on a single webpage.

The easiest way to extract the information is to use Outwit Hub’s preset «tables» scraper:

  1. Copy the url into the address bar in Outwit Hub. As long as the option «page» is selected in the left-hand column, you’ll see the webpage pretty much the same way as you’d see it in your browser.
  2. ’Now select the option «tables» in the left-hand column. You’ll get to see a table with the information Outwit Hub has extracted from the webpage (if you’re curious, also try what happens when you select «lists»). The first columns of the output table contain information that you may not need.
  3. In order to export the information, select the rows you want (in this case: all rows), right click, «Export selection as», «Excel». And you’re done.

Of course, rather than using the preset table scraper, you may want to try to create your own scraper:

  1. If you haven’t done so yet, copy the url into the address bar in Outwit Hub.
  2. In the left-hand column, select «scrapers».
  3. You now get two windows: at the top you see the html code of the page you’re about to scrape; this is where you look for patterns that can be used for Markers Before and Markers After. In the lower window, there are the lines where you’ll enter those markers to define the chunks of information you want extracted.
  4. You will move back-and-forth between the «scrapers» and «page» options in the left-hand column: on the page you look for examples of pieces of information you want to retrieve, and in html code (shown when you click «scrapers»), you’ll look for markers that define where these pieces of information can be found within the html code.
  5. At the bottom of the screen, click «New», give your scraper a name, and save it.
  6. First set the markers to define which information should be included in each row. Click the «page» option and, by way of example, look up what information should be included in the first row. This would be a date, the company name Kléber, and descriptions of the action and the outcome.
  7. Now return to the «scrapers» view. Enter «Kléber» in the search box below the html code. You’ll find that all the information from the first row is contained between <tr> and </tr> tags (<a href=“https://www.flickr.com/photos/dirkmjk/14599374193/” target=“_blank”)>screenshot).
  8. Fill out the first line of your scraper in the bottom window: Description = Row; Marker Before = <tr> and Marker After = </tr>. If you want to, you can click the «Execute» button to the right to check how your scraper is doing so far; all the information from your original table should now be in your output, separated into different rows but not yet into different columns. Click «scrapers» to return to your scraper.
  9. If you take another look at the html code, you may find that all the bits of information that should go into different cells are contained in <td valign=“top”> and </td> tags.
  10. Fill out the second line of your scraper: Description=«Cell»; Marker Before = <td valign=“top”>, Marker After = </td>. Click «Execute».
  11. If you inspect the output, you’ll find that your scraper isn’t as good as Outwit Hub’s tables scraper: the column names are not where they should be and the last row contains a list of all the tags used on the blog. Of course, these are things you can easily clean up in Calc or Excel.
  12. In the output window, select the rows you want (in this case: rows 2–14), right click, «Export selection as», «Excel».

Example: Wikipedia Yellow Jerseys table

If you’re interested in riders who won Yellow Jerseys in the Tour de France, you can find statistics on this Wikipedia page. Again, the information is presented in a single table on a single website.

Again, the easy way is to use Outwit Hub’s «tables» scraper:

  1. Copy the url into the address bar in Outwit Hub.
  2. Select «tables» in the left-hand column.
  3. Note that there are actually three tables on the webpage and that the information from all three tables has been put into one single table in the Outwit Hub output. You can select the relevant rows before exporting your data.
  4. Also note that the column names are lacking in the header row; instead the column names have been included in all the cells. You can later clean this up in Calc or Excel.
  5. To export the output, select the rows you want (1:276), right click, «Export selection as», «Excel».

And here’s how you create your own scraper:

  1. Copy the url into the address bar in Outwit Hub. ’,’In the left-hand column, select «scrapers».
  2. ’At the bottom of the screen, click «New», give your scraper a name, and save it.
  3. ’The first row in the table you’re interested in contains the name Eddy Merckx (who else). Enter «Eddy Merckx» in the search box below the html code. Skip the first few instances (because Eddy Merckx is mentioned a few times on the webpage before the table starts), until you find this bit of code. Again, all the information from the first row is contained between <tr> and </tr> tags.
  4. Fill out the first line of your scraper in the bottom window: Description = Row, Marker Before = <tr> and Marker After = </tr>.
  5. If you take another look at the html code, you’ll find that all the bits of information that should go into different cells are defined by <td> and </td> tags.
  6. Fill out the second line of your scraper: Description = Cell, Marker Before = <td> and Marker After = </td>. Click «Execute».
  7. Now, if you check your output, you’ll find that you only have 254 rows of results for this table (257 minus the first 3 irrelevant ones - or perhaps a few more if information from subsequent tours has been added to the Wikipedia page), that’s fewer than the 276 you got when you used the preset «tables» scraper. Apparently something’s gone wrong. If, in the output, you scroll to the right, you’ll find that in some cases your scraper failed to identify the beginning of a new row and instead added the information that should have gone into a new row to the end of the previous row. The first instance where this happens is with Fabian Cancellara.
  8. Return to the «scrapers» view. Enter «Cancellara» in the search box. You’ll find this bit of html code. It turns out that this row isn’t preceded by a <tr> tag but by a <tr style=“background:#CEDFF2;”> tag. A pragmatic solution may be to replace the <tr> Marker Before in the first line of your scraper with <tr (that is, without the right-pointing angle bracket). Click «Execute», you should now get the correct number of rows.
  9. In the output window, select the rows you want (in this case: rows 4–279), right click, «Export selection as», «Excel».

Example: the Fall band members

Mark E. Smith of the Fall is a brilliant musician, but he does have a reputation for discarding band members. If you want to analyse the Fall band member turnover, you can find the data here. This time, the data is not in a table structure. The webpage does have a list structure, but the list elements are the descriptions of band members, not their names and the years in which they were band members. So Outwit Hub’s «tables» and «lists» scrapers won’t be much help in this case – you’ll have to create your own scaper.

To extract the information:

  1. Enter the url in the address bar of Outwit Hub.
  2. Click «scrapers» in the left-hand column.
  3. At the bottom of the screen, click «New», give your scraper a name, and save it.
  4. In the search box below the html window, enter Marc Riley (or any other example of a band member).
  5. You’ll find that the key data about Marc Riley are contained within <p><b> and </p> tags. Use these as Marker Before and Marker After in the first line of your scraper (also try using just <b> as Marker Before and try to figure out why that doesn’t get you the result you wanted).
  6. Click «Execute» and inspect the output. The information in the rows isn’t separated into cells, but you can clean that up in Calc or Excel: Text to columns, use first a dash and then a left parenthesis as separator (Incidentally, the paid version of Outwit Hub will let you add a separator to the scraper itself).
  7. In the output window, select the rows you want (in this case all), right click, «Export selection as», «Excel».’

Navigating through links on a webpage

In the previous examples, all the information could be found on a single webpage. Often, the information will be spread out over a series of webpages. Hopefully, there will also be a page with links to all the pages that contain the relevant information. Let’s call the page with links the index page and the webpages it links to (where the actual information is to be found) the linked pages.

Strategy

You’ll need a strategy to follow the links on the index page and collect the information from all the linked pages. Here’s how you do it:

Two remarks:

Example: Tour de France 2013 stages

We’ll return to the Tour de France Yellow Jersey, but this time we’ll look in more detail into the stages of the 2013 edition. Information can be found on the official webpage of le Tour.

  1. Enter the url of the webpage in the address bar of Outwit Hub.
  2. The page contains links to all the pages with results of the 21 stages, but in the «page» view these links aren’t easy to find. In the left-hand column, click «links» to inspect all the urls on this webpage. Fill out «stage» in the search box to the bottom of the window to find all the rows with in formation about stages (screenshot). In the Page Url column, double click on the url of stage 1 to go to that page.
  3. Click on «page» in the left-hand column to see what information the page contains. We’re interested in extracting some identifying information about this stage (so that later on it’ll be clear that the row of information that came from this page is the information for stage 1), as well as the details about the winner of the yellow jersey.
  4. Click on «scrapers». In the search box, enter «stage 1» and skip a few instances untill you find detailed information about this stage, looking like this (interestingly, the html code contains more information than is shown on the webpage). Fill out the first line of the scraper: Description = Stage; Marker before = <div class=“overall”>, Marker After = <span></div></div><div class=“buttons”>.
  5. In the search box, enter the name of the yellow jersey winner, Kittel. Skip a few instances until you find the detailed information about the yellow jersey, looking like <a https://www.flickr.com/photos/dirkmjk/14392668789/' target=“_blank”>this. Fill out the second line of your scraper with Description = Yellow Jersey, Marker Before = <h5>yellow jersey</h5>; Marker After = <h5>green jersey</h5>.
  6. Run the scraper. You have collected information including the name of the yellow jersey winner and his total time, as well as the stage number, the start and finish locations and the start time. The bits of information aren’t properly separated into different columns, but again you can correct that in Calc or Excel (using Text to Columns).
  7. Before you tell Outwit Hub to apply the scraper to all the relevant links, change the catch / empty settings with the buttons to the bottom of the output window. This is because normally, Outwit Hub will discard old output each time the scraper runs again (to prevent the output window filling up with a large amount of junk). However, we now want to keep the output from all the runs of the scraper. Select the settings Auto-Catch and Empty-on-Demand (you’ll get a warning that you can ignore).
  8. Click the «Go back one page» button to the top left of the screen (black left-pointing triangle) to return to the index page.
  9. In the left-hand column, click «links».
  10. Again, use the search box to the bottom of the window to find rows about «stage». Select all the 21 rows that contain a link to a page with stage results.
  11. Right click, «Auto Explore Pages», «Fast Scrape», and select the scraper you just created.
  12. You should now have 21 rows with details about all the stages.
  13. In the output window, select the rows you want, right click, «Export selection as», «Excel».

Navigating through multiple pages with links

Same as above, but now the links to the linked pages are not to be found on a single index page, but a series of index pages.

Strategy

First create a web scraper for one of the linked pages, then collect the links from the index page so you can tell Outwit Hub to apply your scraper to all the linked pages. However, you’ll need one more step before you can tell Outwit Hub to apply the scraper: you’ll need to collect the links from all the index pages, not just the first one. In many cases, Outwit Hub will be able to find out by itself how to move through all the index pages.

Example: Proceedings of Parliament

Suppose you want to analyse how critically Dutch Members of Parliament have been following the Dutch intelligence service AIVD over the past 15 years or so. You can search the questions they have asked with a search query like this, which gives you 206 results, and their urls can be found on a series of 21 index pages (perhaps new questions have been asked since, in which case you’ll get a higher number of results). So the challenge is to create a scraper for one of the linked pages and then get Outwit Hub to apply this scraper to all the links from all 21 index pages.

  1. Enter the url of the webpage in the address bar of Outwit Hub.
  2. Click on the first search result (in case the newest result has Onopgemaakt in the description, which means the question hasn’t been added in html form yet, use an older search result).
  3. Click «scrapers» in the left-hand column
  4. At the bottom of the screen, click «New», give your scraper a name, and save it.
  5. Fill out the first line of the scraper: Description = Date, Marker Before = OVERHEIDop.datumIndiening" content=", Marker After = ".
  6. Fill out the second line of the scraper: Description = Text, Marker Before = <div id=“broodtekst”>, Marker After = #VolledigeInhoudsOpgave.
  7. Click the «Go back one page» button to the top left of the screen to return to the index page.
  8. In the left-hand column, click «links».
  9. Below the output window, select the settings Auto-Catch and Empty-on-Demand (ignore the warning).
  10. To the top left of the screen, click the «Auto browse through series of pages» button (a double right-pointing blueish triangle).
  11. Outwit Hub will now navigate through all the index pages and repeat the action it just performed, i.e. extracting the links. This may take a few minutes. When it’s done, the output window should show all the links from all the index pages – or if you have the free version of Outwit Hub, a maximum of 100 links. (Note that in this case 100 links doesn’t equal 100 result pages, for the index pages also contain a lot of irrelevant links.)
  12. From all the links in the output window, you now want to select the relevant ones. It so happens that links to results all have a filename starting with «kv», so you can select the relevant links by sorting the Filename column.
  13. Once you have selected the relevant links, right click, «Auto Explore Pages», «Fast Scrape», and select the scraper you just created.
  14. You may end up with more results than you expected. Check for incomplete results and filter out duplicates by selecting the option Deduplicate to the bottom of the output window (but be warned that deselecting that option may remove all the output, so you should probably export a backup of all your results before using the Deduplicate option).
  15. In the output window, select the rows you want, right click, «Export selection as», «Excel».

Resources

5 July 2014 | Categories: data, howto