Some websites offer data that you can download as an Excel or CSV file (e.g., Eurostat), or they may offer structured data in the form of an API. Other websites contain useful information, but they don’t provide that information in a convenient form. In such cases, a webscraper may be the tool to extract that information and store it in a format that you can use for further analysis.
If you really want control over your scraper the best option is probably to write it yourself, for example in Python. If you’re not into programming, there are some apps that may help you out. One is Outwit Hub. Below I will provide some step by step examples of how you can use Outwit Hub to scrape websites and export the results as an Excel file.
But first a few remarks:
The Outwit Hub app can be downloaded here (it’s also available as a Firefox plugin, but last time I checked it wasn’t compatible with the newest version of Firefox).
- Outwit Hub comes in a free and a paid version; the paid version has more options. As far as I can tell, the most important limitation of the free version is that it will only let you extract 100 records per query. In the examples below, I’ll try to stick to functionality available in the free version.
- Information on websites may be copyrighted. Using that information for other purposes than personal use (e.g. publishing it) may be a violation of copyright.
- Webscraping is a messy process. The data you extract may need some cleaning up. More importantly, always do some checks to make sure the scraper is functioning properly. For example, is the number of results you got consistent with what you expected? Check some examples to see if the numbers you get are correct and if they have ended up in the right row and column.
Scraping a single webpage
Sometimes, all the information you’re looking for will be available from one single webpage.
Out of the box, Outwit Hub comes with a number of preset scrapers. These include scrapers for extracting links, tables and lists. In many cases, it makes sense to simply try Outwit Hub’s tables and lists scrapers to see if that will get you the results you want. It will save you some time, and often the results will be cleaner than when you create your own scraper.
Sometimes, however, you will have to create your own scraper. You do so by telling Outwit Hub which chunks of information it should look for. The output will be presented in the form of a table, so think of the information as cases (units of information that should go into one row) and within those cases, the different types of information you want to retrieve about those cases (the information that should go into the different cells within a row).
You tell Outwit Hub what information to look for by defining the «Marker Before» and the «Marker After». For example, you may want to extract the tekst of a title that is represented as
<h1>Chapter One<h1> in the html code. In this case the Marker Before could be
<h1> and the Marker After could be
</h1>. This would tell Outwit Hub to extract any text between those two markers.
It may take some trial and error to get the markers right. Ideally, they should meet two criteria:
- They should capture all the instances you want included. For example, if some of the titles you want to extract aren’t h1 titles but h2 titles, the
</h1> markers will give you incomplete results. Perhaps you could use
</h as markers.
- They should capture as little irrelevant pieces of information as possible. For example, you may find that an interesting piece of information is located between
</p> tags. However, p-tags (used to define paragraphs in a text) may occur a lot on a webpage and you may end up with a lot of irrelevant results. So you may want to try to find markers that more precisely define what you’re looking for.
Some French workers have resorted to «bossnapping» as a response to mass layoffs during the crisis. If you’re interested in the phenomenon, you can find some information from a paper on the topic summarized here. From a webscraping perspective, this is pretty straightforward: all the information can be found in one table on a single webpage.
The easiest way to extract the information is to use Outwit Hub’s preset «tables» scraper:
Of course, rather than using the preset table scraper, you may want to try to create your own scraper:
Example: Wikipedia Yellow Jerseys table
If you’re interested in riders who won Yellow Jerseys in the Tour de France, you can find statistics on this Wikipedia page. Again, the information is presented in a single table on a single website.
Again, the easy way is to use Outwit Hub’s «tables» scraper:
And here’s how you create your own scraper:
Example: the Fall band members
Mark E. Smith of the Fall is a brilliant musician, but he does have a reputation for discarding band members. If you want to analyse the Fall band member turnover, you can find the data here. This time, the data is not in a table structure. The webpage does have a list structure, but the list elements are the descriptions of band members, not their names and the years in which they were band members. So Outwit Hub’s «tables» and «lists» scrapers won’t be much help in this case – you’ll have to create your own scaper.
To extract the information:
Navigating through links on a webpage
In the previous examples, all the information could be found on a single webpage. Often, the information will be spread out over a series of webpages. Hopefully, there will also be a page with links to all the pages that contain the relevant information. Let’s call the page with links the index page and the webpages it links to (where the actual information is to be found) the linked pages.
You’ll need a strategy to follow the links on the index page and collect the information from all the linked pages. Here’s how you do it:
- First visit one of the linked pages and create a scraper to retrieve the information you need from that page.
- Return to the index page and tell Outwit Hub to extract all the links from that page.
- Try to filter these links as well as you can to exclude irrelevant links (most webpages contain large numbers of links and most of them are probably irrelevant for your purposes).
- Tell Outwit Hub to apply the scraper (the one you created for one of the linked pages) to all the linked pages.
- Hopefully, all the linked pages have the same structure, but don’t count on it. You’ll need to check if your scraper works properly for all the linked pages.
- In the output window, make sure to set the catch / empty settings correctly because otherwise Outwit Hub will discard the output collected so far before moving to the next linked page.
Example: Tour de France 2013 stages
We’ll return to the Tour de France Yellow Jersey, but this time we’ll look in more detail into the stages of the 2013 edition. Information can be found on the official webpage of le Tour.
Navigating through multiple pages with links
Same as above, but now the links to the linked pages are not to be found on a single index page, but a series of index pages.
First create a web scraper for one of the linked pages, then collect the links from the index page so you can tell Outwit Hub to apply your scraper to all the linked pages. However, you’ll need one more step before you can tell Outwit Hub to apply the scraper: you’ll need to collect the links from all the index pages, not just the first one. In many cases, Outwit Hub will be able to find out by itself how to move through all the index pages.
Example: Proceedings of Parliament
Suppose you want to analyse how critically Dutch Members of Parliament have been following the Dutch intelligence service AIVD over the past 15 years or so. You can search the questions they have asked with a search query like this, which gives you 206 results, and their urls can be found on a series of 21 index pages (perhaps new questions have been asked since, in which case you’ll get a higher number of results). So the challenge is to create a scraper for one of the linked pages and then get Outwit Hub to apply this scraper to all the links from all 21 index pages.