How to automate extracting tables from PDFs, using Tabula

11 April 2017

One of my colleagues needs tables extracted from a few hundred PDFs. There’s an excellent tool called Tabula that I frequently use, but you have to process each PDF manually. However, it turns out you can also automate the process. For those like me who didn’t know, here’s how it works.

Command line tool

You can download tabula-java’s jar here (I had no idea what a jar is, but apparently it’s a format to bundle Java files). You also need a recent version of Java. Note that on a Mac, Terminal may still use an old version of Java even if you have a newer version installed. The problem and how to solve it are discussed here.

For this example, create a project folder and store the jar in a subfolder script. Store the PDFs you want to process in a subfolder data/pdf and create an empty subfolder data/csv.

On a Mac, open Terminal, use cd to navigate to your project folder and run the following code (make sure the version number of the tabula jar is correct):

for i in data/pdf/*.pdf; do java -jar script/tabula-0.9.2-jar-with-dependencies.jar -n -p all -a 29.75,43.509,819.613,464.472 -o ${i//pdf/csv} $i; done

On Windows, open the command prompt, use cd to navigate to your project folder and run the following code (again, make sure the version number of the tabula jar is correct):

for %i in (data/pdf/*.pdf) do java -jar script/tabula-0.9.2-jar-with-dependencies.jar -n -p all -a 29.75,43.509,819.613,464.472 -o data/csv/%~ni.csv data/pdf/%i

The settings you can use are described here. The examples above use the following settings:

In my experience, you may need to tinker a bit with the settings to get the results right. Even so, Tabula will sometimes get the rows right but incorrectly or inconsistently identify cells within a row. You may be able to solve this using regex.

Python (and R)

There’s a Python wrapper, tabula-py that will turn PDF tables into Pandas dataframes. As with tabula-java, you need a recent version of Java. Here’s an example of how you can use tabula-py:

import tabula
import os
import pandas as pd
folder = 'data/pdf/'
paths = [folder + fn for fn in os.listdir(folder) if fn.endswith('.pdf')]
for path in paths:
    df = tabula.read_pdf(path, encoding = 'latin1', pages = 'all', area = [29.75,43.509,819.613,464.472], nospreadsheet = True)
    path = path.replace('pdf', 'csv')
    df.to_csv(path, index = False)
});

Using the Python wrapper, I needed to specify the encoding. I ran into a problem when I tried to extract tables with varying sizes from multi-page PDFs. I think it’s the same problem as reported here. From the response, I gather the problem may be addressed in future versions of tabula-py.

For those who use R, there’s also an R wrapper for tabula, tabulizer. I haven’t tried it myself.

Call tabula-java from Python

[Update 2 May 2017] - I realised there’s another way, which is to call tabula-java from Python. Here’s an example:

import os
pdf_folder = 'data/pdf'
csv_folder = 'data/csv'
base_command = 'java -jar tabula-0.9.2-jar-with-dependencies.jar -n -p all -f TSV -o {} {}'
for filename in os.listdir(pdf_folder):
    pdf_path = os.path.join(pdf_folder, filename)
    csv_path = os.path.join(csv_folder, filename.replace('.pdf', '.csv'))
    command = base_command.format(csv_path, pdf_path)
    os.system(command)

This solves tabula-py’s problem with multipage pdf’s containing tables with varying sizes.

11 April 2017 | Categories: data, howto, python