champagne anarchist | armchair activist

Python

How to automate extracting tables from PDFs, using Tabula

One of my colleagues needs tables extracted from a few hundred PDFs. There’s an excellent tool called Tabula that I frequently use, but you have to process each PDF manually. However, it turns out you can also automate the process. For those like me who didn’t know, here’s how it works.

Command line tool

You can download tabula-java’s jar here (I had no idea what a jar is, but apparently it’s a format to bundle Java files). You also need a recent version of Java. Note that on a Mac, Terminal may still use an old version of Java even if you have a newer version installed. The problem and how to solve it are discussed here.

For this example, create a project folder and store the jar in a subfolder script. Store the PDFs you want to process in a subfolder data/pdf and create an empty subfolder data/csv.

On a Mac, open Terminal, use cd to navigate to your project folder and run the following code (make sure the version number of the tabula jar is correct):

for i in data/pdf/*.pdf; do java -jar script/tabula-0.9.2-jar-with-dependencies.jar -n -p all -a 29.75,43.509,819.613,464.472 -o ${i//pdf/csv} $i; done

On Windows, open the command prompt, use cd to navigate to your project folder and run the following code (again, make sure the version number of the tabula jar is correct):

for %i in (data/pdf/*.pdf) do java -jar script/tabula-0.9.2-jar-with-dependencies.jar -n -p all -a 29.75,43.509,819.613,464.472 -o data/csv/%~ni.csv data/pdf/%i

The settings you can use are described here. The examples above use the following settings:

  • -n: stands for nospreadsheet; use this if the tables in the PDF don’t have gridlines.
  • -p all: look for tables in all pages of the document. Alternatively, you can specify specific pages.
  • -a (area): the portion of the page to analyse; default is the entire page. You can choose to omit this setting, which may be a good idea when the location or size of tables varies. On the other hand, I‘ve had a file where tables from one specific page were not extracted unless I set the area variable. The area is defined by coordinates that you can obtain by analysing one PDF manually with the Tabula app and exporting the result not as csv, but as script.
  • -o: the name of the file to write the csv to.

In my experience, you may need to tinker a bit with the settings to get the results right. Even so, Tabula will sometimes get the rows right but incorrectly or inconsistently identify cells within a row. You may be able to solve this using regex.

Python (and R)

There’s a Python wrapper, tabula-py that will turn PDF tables into Pandas dataframes. As with tabula-java, you need a recent version of Java. Here’s an example of how you can use tabula-py:

import tabula
import os
import pandas as pd
 
folder = 'data/pdf/'
paths = [folder + fn for fn in os.listdir(folder) if fn.endswith('.pdf')]
for path in paths:
    df = tabula.read_pdf(path, encoding = 'latin1', pages = 'all', area = [29.75,43.509,819.613,464.472], nospreadsheet = True)
    path = path.replace('pdf', 'csv')
    df.to_csv(path, index = False)

Using the Python wrapper, I needed to specify the encoding. I ran into a problem when I tried to extract tables with varying sizes from multi-page PDFs. I think it’s the same problem as reported here. From the response, I gather the problem may be addressed in future versions of tabula-py.

For those who use R, there’s also an R wrapper for tabula, tabulizer. I haven’t tried it myself.

Call tabula-java from Python

[Update 2 May 2017] - I realised there’s another way, which is to call tabula-java from Python. Here’s an example:

import os
 
pdf_folder = 'data/pdf'
csv_folder = 'data/csv'
 
base_command = 'java -jar tabula-0.9.2-jar-with-dependencies.jar -n -p all -f TSV -o {} {}'
 
for filename in os.listdir(pdf_folder):
    pdf_path = os.path.join(pdf_folder, filename)
    csv_path = os.path.join(csv_folder, filename.replace('.pdf', '.csv'))
    command = base_command.format(csv_path, pdf_path)
    os.system(command)

This solves tabula-py’s problem with multipage pdf’s containing tables with varying sizes.

New Python package for downloading and analysing street networks

stationsplein

The image above shows square mile diagrams of cyclable routes in the area around the Stationsplein in Amsterdam, the Hague, Rotterdam and Utrecht. I made the maps with OSMnx, a Python package created by Geoff Boeing, a PhD candidate in urban planning at UC Berkeley (via).

Square mile diagrams are a nice gimmick (with practical uses), but they’re just the tip of the iceberg of what OSMnx can do. You can use it to download administrative boundaries (e.g. the outline of Amsterdam) as well as street networks from Open Street Map. And you can analyse these networks, for example: assess their density, find out which streets are connections between separate clusters in the network, or show which parts of the city have long or short blocks (I haven’t tried doing network measure calculations yet).

Boeing boasts that his package not only offers functionality that wasn’t (easily) available yet, but also that many tasks can be performed with a single line of code. From what I’ve seen so far, it’s true: the package is amazingly easy to use. All in all, I think this is a great tool.

DuckDuckGo shows code examples

Because of Google’s new privacy warning, I finally changed my default search engine to DuckDuckGo.[1] So far, I’m quite happy with it. I was especially pleased when I noticed they sometimes show code snippets or excerpts from documentation on the results page.

Apparently, DDG has decided that it wants to be «the best search engine for programmers». One feature they’re using are the instant answers that are sometimes shown in addition to the ‘normal’ search results. These instant answers may get their contents from DDGs own databases - examples include cheat sheets created for the purpose - or they may use external APIs, such as the Stack Overflow API. Currently, volunteers are working to improve search results for the top 15 programming languages, including Javascript, Python and R.

One could argue that instant answers promote the wrong kind of laziness - copying code from the search results page rather than visit the original post on Stack Overflow. But for quickly looking up trivial stuff, I think this is perfect.


  1. I assume the contents of the privacy warning could have been reason to switch search engines, but what triggered me was the intrusive warning that Google shows in each new browsers session - basically punishing you for having your browser throw away cookies.  ↩

Amsterdam heeft ruimte voor nog eens 2,1 miljoen fietsenrekken

kaart

Amsterdam kampt met een hardnekkig tekort aan fietsenrekken. Fietsprofessor Marco te Brömmelstroet voert echter aan dat dit een kwestie is van keuzes maken: op de plek van vier geparkeerde auto’s kan je makkelijk 30 fietsenrekken kwijt.

Amsterdam is een compacte stad waar ruimte schaars is. Een belangrijk doel van het gemeentebestuur is om meer ruimte te creëren voor voetgangers en fietsers, maar ook voor openbaar groen.

Toevallig heeft Amsterdam onlangs open data gepubliceerd over parkeervakken voor straatparkeren. De gegevens bevestigen wat we eigenlijk al wisten: parkeerplaatsen nemen enorm veel publieke ruimte in beslag. De straten van Amsterdam zijn bezaaid met maar liefst 265.225 parkeervakken. Als je de parkeerplaatsen met een bord (oplaadplekken, autodaten, etcetera) buiten beschouwing laat, dan zijn het er nog altijd 260.834.

Als je aanneemt dat elke parkeerplek ruimte zou kunnen bieden aan zeker 8 fietsen, dan is er ruimte voor 2,1 miljoen extra fietsenrekken. Natuurlijk ga je niet alle parkeervakken verwijderen en volbouwen met fietsenrekken, maar het illustreert de keuzeruimte die er is bij de inrichting van de openbare ruimte.

Detailkaart | Verantwoording

Amsterdam has room for another 2.1 million bicycle racks

kaart

Amsterdam has a persistent shortage of bicycle racks. Bicycle professor Marco te Brömmelstroet argues that this is really a matter of making choices: the space occupied by four parked cars could easily accommodate 30 bicycle racks.

Amsterdam is a compact city where space is limited. An important goal of the city administration is to create more room for pedestrians and cyclists, but also for green areas.

It so happens that the city of Amsterdam has recently published open data on on-street parking spaces. The data confirms what we already knew: parking spaces for cars occupy a huge amount of public space. The streets of Amsterdam are littered with as many as 265,225 parking spaces. If you exclude the ones with signs (spaces for charging car batteries; car sharing; etcetera), there are still 260,834 of them.

Assuming that each of them could accomodate at least 8 bicycle racks, there’s room for another 2.1 million bicycle racks. Now you probably wouldn’t want to remove all parking spaces and replace them with bicycle racks, but it does illustrate some of the choices that are available regarding the use of public space.

Map detail here.

Method

The open data on on-street parking spaces is available in WFS format which is meant for creating maps but can also be used for downloading data - here’s a Python script that will do the job. I set the location of the parking spaces to the centre of the surrounding envelope.

I would have liked to display the data on an interactive map using Leaflet and D3js, but I’m afraid the quarter million data points would crash the browser. Instead I used OSM map data in combination with Qgis to display the parking spaces. Unfortunately, this means you can’t zoom in.

As for the parking space to bicycle rack ratio: I’m assuming a typical parking space takes up 12 to 14 m2. Cyclists’ organisation Fietsersbond has calculated that regular bicycle racks take up between 0.84 and 1.18 m2 per bicycle. The city of Amsterdam is a bit more conservative and estimates that a bicycle rack takes up about 1.5 m2, including the room needed to remove the bicycle. This suggests that the number of bicycle racks that could be created per parking space lies somewhere between 8 and 9.3.

Update 3 July 2016 - The city of Nijmegen reckons it can fit as many as 10 bicycle racks on a parking space.
Update 31 January 2017 - And the city of New York needs about nine parking spaces to accomodate 69 city bikes.
And a follow-up (in Dutch).

Pages