champagne anarchist | armchair activist

howTo

How to automate extracting tables from PDFs, using Tabula

One of my colleagues needs tables extracted from a few hundred PDFs. There’s an excellent tool called Tabula that I frequently use, but you have to process each PDF manually. However, it turns out you can also automate the process. For those like me who didn’t know, here’s how it works.

Command line tool

You can download tabula-java’s jar here (I had no idea what a jar is, but apparently it’s a format to bundle Java files). You also need a recent version of Java. Note that on a Mac, Terminal may still use an old version of Java even if you have a newer version installed. The problem and how to solve it are discussed here.

For this example, create a project folder and store the jar in a subfolder script. Store the PDFs you want to process in a subfolder data/pdf and create an empty subfolder data/csv.

On a Mac, open Terminal, use cd to navigate to your project folder and run the following code (make sure the version number of the tabula jar is correct):

for i in data/pdf/*.pdf; do java -jar script/tabula-0.9.2-jar-with-dependencies.jar -n -p all -a 29.75,43.509,819.613,464.472 -o ${i//pdf/csv} $i; done

On Windows, open the command prompt, use cd to navigate to your project folder and run the following code (again, make sure the version number of the tabula jar is correct):

for %i in (data/pdf/*.pdf) do java -jar script/tabula-0.9.2-jar-with-dependencies.jar -n -p all -a 29.75,43.509,819.613,464.472 -o data/csv/%~ni.csv data/pdf/%i

The settings you can use are described here. The examples above use the following settings:

  • -n: stands for nospreadsheet; use this if the tables in the PDF don’t have gridlines.
  • -p all: look for tables in all pages of the document. Alternatively, you can specify specific pages.
  • -a (area): the portion of the page to analyse; default is the entire page. You can choose to omit this setting, which may be a good idea when the location or size of tables varies. On the other hand, I‘ve had a file where tables from one specific page were not extracted unless I set the area variable. The area is defined by coordinates that you can obtain by analysing one PDF manually with the Tabula app and exporting the result not as csv, but as script.
  • -o: the name of the file to write the csv to.

In my experience, you may need to tinker a bit with the settings to get the results right. Even so, Tabula will sometimes get the rows right but incorrectly or inconsistently identify cells within a row. You may be able to solve this using regex.

Python (and R)

There’s a Python wrapper, tabula-py that will turn PDF tables into Pandas dataframes. As with tabula-java, you need a recent version of Java. Here’s an example of how you can use tabula-py:

import tabula
import os
import pandas as pd

folder = 'data/pdf/'
paths = [folder + fn for fn in os.listdir(folder) if fn.endswith('.pdf')]
for path in paths:
    df = tabula.read_pdf(path, encoding = 'latin1', pages = 'all', area = [29.75,43.509,819.613,464.472], nospreadsheet = True)
    path = path.replace('pdf', 'csv')
    df.to_csv(path, index = False)

Using the Python wrapper, I needed to specify the encoding. I ran into a problem when I tried to extract tables with varying sizes from multi-page PDFs. I think it’s the same problem as reported here. From the response, I gather the problem may be addressed in future versions of tabula-py.

For those who use R, there’s also an R wrapper for tabula, tabulizer. I haven’t tried it myself.

Call tabula-java from Python

[Update 2 May 2017] - I realised there’s another way, which is to call tabula-java from Python. Here’s an example:

import os

pdf_folder = 'data/pdf'
csv_folder = 'data/csv'

base_command = 'java -jar tabula-0.9.2-jar-with-dependencies.jar -n -p all -f TSV -o {} {}'

for filename in os.listdir(pdf_folder):
    pdf_path = os.path.join(pdf_folder, filename)
    csv_path = os.path.join(csv_folder, filename.replace('.pdf', '.csv'))
    command = base_command.format(csv_path, pdf_path)
    os.system(command)

This solves tabula-py’s problem with multipage pdf’s containing tables with varying sizes.

Embedding D3.js charts in a responsive website

UPDATE - better approach here.

For a number of reasons, I like to use D3.js for my charts. However, I’ve been struggling for a while to get them to behave properly on my blog which has a responsive theme. I’ve tried quite a few solutions from Stack Overflow and elsewhere but none seemed to work.

I want to embed the chart using an iframe. The width of the iframe should adapt to the column width and the height to the width of the iframe, maintaining the aspect ratio of the chart. The chart itself should fill up the iframe. Preferably, when people rotate their phone, the size of the iframe and its contents should update without the need to reload the entire page.

Styling the iframe

Smashing Magazine has described a solution for embedding videos. You enclose the iframe in a div and use css to add a padding of, say, 40% to that div (the percentage depending on the aspect ratio you want). You can then set both width and height of the iframe itself to 100%. Here’s an adapted version of the code:

<style>
.container_chart_1 {
    position: relative;
    padding-bottom: 40%;
    height: 0;
    overflow: hidden;
}

.container_chart_1 iframe {
    position: absolute;
    top:0;
    left: 0;
    width: 100%;
    height: 100%;
}
</style>

<div class ='container_chart_1'>
<iframe src='http://dirkmjk.nl/2016/embed_d3/chart_1.html' frameborder='0' scrolling = 'no' id = 'iframe_chart_1'>
</iframe>
</div>

Making the chart adapt to the iframe size

The next question is how to make the D3 chart adapt to the dimensions of the iframe. Here’s what I thought might work but didn’t: in the chart, obtain the dimensions of the iframe using window.innerWidth and window.innerHeight (minus 16px - something to do with scrollbars apparently?) and use those to define the size of your chart.

Using innerWidth and innerHeight seemed to work - until I tested it on my iPhone. Upon loading a page it starts out OK, but then the update function increases the size of the chart until only a small detail is visible in the iframe (rotate your phone to replicate this). Apparently, iOS returns not the dimensions of the iframe but something else when innerWidth and innerHeight are used. I didn’t have that problem when I tested on an Android phone.

Adapt to the iframe size: Alternative solution

Here’s an alternative approach for making the D3 chart adapt to the dimensions of the iframe. Set width to the width of the div that the chart is appended to (or to the width of the body) and set height to width * aspect ratio. Here’s the relevant code:

var aspect_ratio = 0.4;
var frame_width = $('#chart_2').width();
var frame_height = aspect_ratio * frame_width;

The disadvantage of this approach is that you’ll have to set the aspect ratio in two places: both in the css for the div containing the iframe and in the html-page that is loaded in the iframe. So if you decide to change the aspect ratio, you’ll have to change it in both places. Other than that, it appears to work.

Reloading the chart upon window resize

Then write a function that reloads the iframe content upon window resize, so as to adapt the size of the chart when people rotate their phone. Note that on mobile devices, scrolling may trigger the window resize. You don’t want to reload the contents of the iframe each time someone scrolls the page. To prevent this, you may add a check whether the window width has changed (a trick I picked up here). Also note that with Drupal, you need to use jQuery instead of $.

width = jQuery(window).width;
jQuery(window).resize(function(){
    if(jQuery(window).width() != width){
        document.getElementById('iframe_chart_1').src = document.getElementById('iframe_chart_1').src;
        width = jQuery(window).width;
    }
});

In case you know a better way - do let me know!

FYI, here’s the chart used as illustration in its original context.

Hoe exporteer je fietsknooppunten naar je Garmin

Fietsknooppunten zijn handig. Je kijkt van te voren langs welke knooppunten je wilt fietsen, schrijft de nummers op een briefje en dat plak je met doorzichtige tape op de bovenbuis van je frame. Maar soms missen er bordjes en raak je de weg kwijt. In het ergste geval raak je hopeloos verstrikt in een troosteloze buitenwijk van Almere.

De oplossing is simpel: exporteer de route naar je Garmin (die moet dan wel navigatie hebben). Hier wordt het helder uitgelegd. De auteur schakelt de kaarten op de Garmin uit. Ik niet, waardoor het nog simpeler wordt. Hier zijn de stappen:

  • Ga naar de routeplanner van de Fietsersbond.
  • Klik op de knop «LF en knooppunten» en zoom in totdat de knooppunten zichtbaar worden.
  • Klik op het knooppunt waar je wil starten en klik in de popup op «Van».
  • Klik op elk knooppunt waar je langs wil fietsen en klik in de popup op «Via».
  • Klik op het knooppunt waar je wil eindigen en klik in de popup op «Naar».
  • Klik op de groene knop «Plan route».
  • Klik in de linkerbalk op «GPS» en in het volgende scherm «GPX bestand».
  • Het bestand wordt opgeslagen in je computer. Sluit de Garmin via de usb aan op je computer, gooi het bestand in de map «Garmin/NewFiles» en ontkoppel de Garmin weer.

Je Garmin maakt er automatisch een course van. Dit is op een Mac, misschien dat het met Windows anders werkt. En voor de geeks: hier las ik hoe je je GPX-file weer op een Leaflet kaart kan tonen.

Step by step: creating an R package

With the help of posts by Hillary Parker and trestletech I managed to create my first R package in RStudio (here’s why) . It wasn’t as difficult as I thought and it seems to work. Below is a basic step-by-step description of how I did it (this assumes you have one or more R functions to include in your package, preferably in separate R-script files):

If you want, you can upload the package to Github. Other people will then be able to install it:

library(devtools)
install_github('username/package-name')

Scraping websites with Outwit Hub: Step by step tutorial

Some websites offer data that you can download as an Excel or CSV file (e.g., Eurostat), or they may offer structured data in the form of an API. Other websites contain useful information, but they don’t provide that information in a convenient form. In such cases, a webscraper may be the tool to extract that information and store it in a format that you can use for further analysis.

If you really want control over your scraper the best option is probably to write it yourself, for example in Python. If you’re not into programming, there are some apps that may help you out. One is Outwit Hub. Below I will provide some step by step examples of how you can use Outwit Hub to scrape websites and export the results as an Excel file.

But first a few remarks:

  • Outwit Hub comes in a free and a paid version; the paid version has more options. As far as I can tell, the most important limitation of the free version is that it will only let you extract 100 records per query. In the examples below, I’ll try to stick to functionality available in the free version.
  • Information on websites may be copyrighted. Using that information for other purposes than personal use (e.g. publishing it) may be a violation of copyright.
  • Webscraping is a messy process. The data you extract may need some cleaning up. More importantly, always do some checks to make sure the scraper is functioning properly. For example, is the number of results you got consistent with what you expected? Check some examples to see if the numbers you get are correct and if they have ended up in the right row and column.
The Outwit Hub app can be downloaded here (it’s also available as a Firefox plugin, but last time I checked it wasn’t compatible with the newest version of Firefox).

Scraping a single webpage

Sometimes, all the information you’re looking for will be available from one single webpage.

Strategy

Out of the box, Outwit Hub comes with a number of preset scrapers. These include scrapers for extracting links, tables and lists. In many cases, it makes sense to simply try Outwit Hub’s tables and lists scrapers to see if that will get you the results you want. It will save you some time, and often the results will be cleaner than when you create your own scraper.

Sometimes, however, you will have to create your own scraper. You do so by telling Outwit Hub which chunks of information it should look for. The output will be presented in the form of a table, so think of the information as cases (units of information that should go into one row) and within those cases, the different types of information you want to retrieve about those cases (the information that should go into the different cells within a row).

You tell Outwit Hub what information to look for by defining the «Marker Before» and the «Marker After». For example, you may want to extract the tekst of a title that is represented as <h1>Chapter One<h1> in the html code. In this case the Marker Before could be <h1> and the Marker After could be </h1>. This would tell Outwit Hub to extract any text between those two markers.

It may take some trial and error to get the markers right. Ideally, they should meet two criteria:

  • They should capture all the instances you want included. For example, if some of the titles you want to extract aren’t h1 titles but h2 titles, the <h1> and </h1> markers will give you incomplete results. Perhaps you could use <h and </h as markers.
  • They should capture as little irrelevant pieces of information as possible. For example, you may find that an interesting piece of information is located between <p> and </p> tags. However, p-tags (used to define paragraphs in a text) may occur a lot on a webpage and you may end up with a lot of irrelevant results. So you may want to try to find markers that more precisely define what you’re looking for.

Example: Bossnappings

Some French workers have resorted to «bossnapping» as a response to mass layoffs during the crisis. If you’re interested in the phenomenon, you can find some information from a paper on the topic summarized here. From a webscraping perspective, this is pretty straightforward: all the information can be found in one table on a single webpage.

The easiest way to extract the information is to use Outwit Hub’s preset «tables» scraper:

Of course, rather than using the preset table scraper, you may want to try to create your own scraper:

Example: Wikipedia Yellow Jerseys table

If you’re interested in riders who won Yellow Jerseys in the Tour de France, you can find statistics on this Wikipedia page. Again, the information is presented in a single table on a single website.

Again, the easy way is to use Outwit Hub’s «tables» scraper:

And here’s how you create your own scraper:

Example: the Fall band members

Mark E. Smith of the Fall is a brilliant musician, but he does have a reputation for discarding band members. If you want to analyse the Fall band member turnover, you can find the data here. This time, the data is not in a table structure. The webpage does have a list structure, but the list elements are the descriptions of band members, not their names and the years in which they were band members. So Outwit Hub’s «tables» and «lists» scrapers won’t be much help in this case – you’ll have to create your own scaper.

To extract the information:

Navigating through links on a webpage

In the previous examples, all the information could be found on a single webpage. Often, the information will be spread out over a series of webpages. Hopefully, there will also be a page with links to all the pages that contain the relevant information. Let’s call the page with links the index page and the webpages it links to (where the actual information is to be found) the linked pages.

Strategy

You’ll need a strategy to follow the links on the index page and collect the information from all the linked pages. Here’s how you do it:

  • First visit one of the linked pages and create a scraper to retrieve the information you need from that page.
  • Return to the index page and tell Outwit Hub to extract all the links from that page.
  • Try to filter these links as well as you can to exclude irrelevant links (most webpages contain large numbers of links and most of them are probably irrelevant for your purposes).
  • Tell Outwit Hub to apply the scraper (the one you created for one of the linked pages) to all the linked pages.

Two remarks:

  • Hopefully, all the linked pages have the same structure, but don’t count on it. You’ll need to check if your scraper works properly for all the linked pages.
  • In the output window, make sure to set the catch / empty settings correctly because otherwise Outwit Hub will discard the output collected so far before moving to the next linked page.

Example: Tour de France 2013 stages

We’ll return to the Tour de France Yellow Jersey, but this time we’ll look in more detail into the stages of the 2013 edition. Information can be found on the official webpage of le Tour.

Navigating through multiple pages with links

Same as above, but now the links to the linked pages are not to be found on a single index page, but a series of index pages.

Strategy

First create a web scraper for one of the linked pages, then collect the links from the index page so you can tell Outwit Hub to apply your scraper to all the linked pages. However, you’ll need one more step before you can tell Outwit Hub to apply the scraper: you’ll need to collect the links from all the index pages, not just the first one. In many cases, Outwit Hub will be able to find out by itself how to move through all the index pages.

Example: Proceedings of Parliament

Suppose you want to analyse how critically Dutch Members of Parliament have been following the Dutch intelligence service AIVD over the past 15 years or so. You can search the questions they have asked with a search query like this, which gives you 206 results, and their urls can be found on a series of 21 index pages (perhaps new questions have been asked since, in which case you’ll get a higher number of results). So the challenge is to create a scraper for one of the linked pages and then get Outwit Hub to apply this scraper to all the links from all 21 index pages.

Resources

Pages