champagne anarchist | armchair activist

How to automate extracting tables from PDFs, using Tabula

One of my colleagues needs tables extracted from a few hundred PDFs. There’s an excellent tool called Tabula that I frequently use, but you have to process each PDF manually. However, it turns out you can also automate the process. For those like me who didn’t know, here’s how it works.

Command line tool

You can download tabula-java’s jar here (I had no idea what a jar is, but apparently it’s a format to bundle Java files). You also need a recent version of Java. Note that on a Mac, Terminal may still use an old version of Java even if you have a newer version installed. The problem and how to solve it are discussed here.

For this example, create a project folder and store the jar in a subfolder script. Store the PDFs you want to process in a subfolder data/pdf and create an empty subfolder data/csv.

On a Mac, open Terminal, use cd to navigate to your project folder and run the following code (make sure the version number of the tabula jar is correct):

for i in data/pdf/*.pdf; do java -jar script/tabula-0.9.2-jar-with-dependencies.jar -n -p all -a 29.75,43.509,819.613,464.472 -o ${i//pdf/csv} $i; done

On Windows, open the command prompt, use cd to navigate to your project folder and run the following code (again, make sure the version number of the tabula jar is correct):

for %i in (data/pdf/*.pdf) do java -jar script/tabula-0.9.2-jar-with-dependencies.jar -n -p all -a 29.75,43.509,819.613,464.472 -o data/csv/%~ni.csv data/pdf/%i

The settings you can use are described here. The examples above use the following settings:

  • -n: stands for nospreadsheet; use this if the tables in the PDF don’t have gridlines.
  • -p all: look for tables in all pages of the document. Alternatively, you can specify specific pages.
  • -a (area): the portion of the page to analyse; default is the entire page. You can choose to omit this setting, which may be a good idea when the location or size of tables varies. On the other hand, I‘ve had a file where tables from one specific page were not extracted unless I set the area variable. The area is defined by coordinates that you can obtain by analysing one PDF manually with the Tabula app and exporting the result not as csv, but as script.
  • -o: the name of the file to write the csv to.

In my experience, you may need to tinker a bit with the settings to get the results right. Even so, Tabula will sometimes get the rows right but incorrectly or inconsistently identify cells within a row. You may be able to solve this using regex.

Python (and R)

There’s a Python wrapper, tabula-py that will turn PDF tables into Pandas dataframes. As with tabula-java, you need a recent version of Java. Here’s an example of how you can use tabula-py:

import tabula
import os
import pandas as pd

folder = 'data/pdf/'
paths = [folder + fn for fn in os.listdir(folder) if fn.endswith('.pdf')]
for path in paths:
    df = tabula.read_pdf(path, encoding = 'latin1', pages = 'all', area = [29.75,43.509,819.613,464.472], nospreadsheet = True)
    path = path.replace('pdf', 'csv')
    df.to_csv(path, index = False)

Using the Python wrapper, I needed to specify the encoding. I ran into a problem when I tried to extract tables with varying sizes from multi-page PDFs. I think it’s the same problem as reported here. From the response, I gather the problem may be addressed in future versions of tabula-py.

For those who use R, there’s also an R wrapper for tabula, tabulizer. I haven’t tried it myself.

Call tabula-java from Python

[Update 2 May 2017] - I realised there’s another way, which is to call tabula-java from Python. Here’s an example:

import os

pdf_folder = 'data/pdf'
csv_folder = 'data/csv'

base_command = 'java -jar tabula-0.9.2-jar-with-dependencies.jar -n -p all -f TSV -o {} {}'

for filename in os.listdir(pdf_folder):
    pdf_path = os.path.join(pdf_folder, filename)
    csv_path = os.path.join(csv_folder, filename.replace('.pdf', '.csv'))
    command = base_command.format(csv_path, pdf_path)
    os.system(command)

This solves tabula-py’s problem with multipage pdf’s containing tables with varying sizes.

Tags: 

Follow this blog:
Twitter (English) | Twitter (Nederlands) | RSS data blog (English) | RSS dirkmjk (Nederlands)