champagne anarchist | armchair activist

Python

How to use Python and Selenium for scraping election results

A while ago, I needed the results of last year’s Lower House election in the Netherlands, by municipality. Dutch election data is available from the website of the Kiesraad (Electoral Board). However, it doesn’t contain a table of results per municipality. You’ll have to collect this information from almost 400 different web pages. This calls for a webscraper.

The Kiesraad website is partly generated using javascript (I think) and therefore not easy to scrape. For this reason, this seemed like a perfect project to explore Selenium.

What’s Selenium? «Selenium automates browsers. That’s it!» Selenium is primarily a tool for testing web applications. However, as a tutorial by Thiago Marzagão explains, it can also be used for webscraping:

[S]ome websites don’t like to be webscraped. In these cases you may need to disguise your webscraping bot as a human being. Selenium is just the tool for that. Selenium is a webdriver: it takes control of your browser, which then does all the work.

Selenium can be used with Python. Instructions to install Selenium are here. You also have to download chromedriver or another driver; you may store it in /usr/local/bin/.

Once you have everything in place, this is how you launch the driver and load a page:

from selenium import webdriver
 
URL = 'https://www.verkiezingsuitslagen.nl/verkiezingen/detail/TK20170315'
 
browser = webdriver.Chrome()
browser.get(URL)

This will open a new browser window. You can use either xpath or css selectors to find elements and then interact with them. For example, find a dropdown menu, identify the options from the menu and select the second one:

XPATH_PROVINCES = '//*[@id="search"]/div/div[1]/div'
element = browser.find_element_by_xpath(XPATH_PROVINCES)
options = element.find_elements_by_tag_name('option')
options[1].click()

If you’d check the page source of the web page, you wouldn’t find the options of the dropdown menu; they’re added afterwards. With Selenium, you needn’t worry about that - it will load the options for you.

Well, actually, there’s a bit more to it: you can’t find and select the options until they’ve actually loaded. Likely, the options won’t be in place initially, so you’ll need to wait a bit and retry.

Selenium comes with functions that specify what it should wait for, and how long it should wait and retry before it throws an error. But this isn’t always straightforward, as Marzagão explains:

Deciding what elements to (explicitly) wait for, with what conditions, and for how long is a trial-and-error process. […] This is often a frustrating process and you’ll need patience. You think that you’ve covered all the possibilities and your code runs for an entire week and you are all happy and celebratory and then on day #8 the damn thing crashes. The servers went down for a millisecond or your Netflix streaming clogged your internet connection or whatnot. It happens.

I ran into pretty similar problems when I tried to scrape the Kiesraad website. I tried many variations of the built-in wait parameters, but without any success. In the end I decided to write a few custom functions for the purpose.

The example below looks up the options of a dropdown menu. As long as the number of options isn’t greater than 1 (the page initially loads with only one option, a dash, and other options are loaded subsequently), it will wait a few seconds and try again - until more options are found or until a maximum number of tries has been reached.

MAX_TRIES = 15
 
def count_options(xpath, browser):
 
    time.sleep(3)
    tries = 0
    while tries < MAX_TRIES:
 
        try:
            element = browser.find_element_by_xpath(xpath)
            count = len(element.find_elements_by_tag_name('option'))
            if count > 1:
                return count
        except:
            pass
 
        time.sleep(1)
        tries += 1
    return count

Here’s a script that will download and save the result pages of all cities for the March 2017 Lower House election, parse the html, and store the results as a csv file. Run it from a subfolder in your project folder.

Notes

Dutch election results are provided by the Kiesraad as open data. In the past, the Kiesraad website used to provide a csv with the results of all the municipalities, but this option is no longer available. Alternatively, a download is available of datasets for each municipality, but at least for 2017, municipalities use different formats.

Scraping the Kiesraad website appears to be the only way to get uniform data per municipality.

Since I originally wrote the scraper, the Kiesraad website has been changed. As a result, it would now be possible to scrape the site in a much easier way, and there would be no need to use Selenium. The source code of the landing page for an election contains a dictionary with id numbers for all the municipalities. With those id numbers, you can create urls for their result pages. No clicking required.

How to do fuzzy matching in Python

Statistics Netherlands (CBS) has an interesting dataset containing data at the city, district and neighbourhood levels. However, some names of neighbourhoods have changed, specifically between 2010 and 2011 for Amsterdam. For example, Bijlmer-Centrum D, F en H was renamed Bijlmer-Centrum (D, F, H).

In some of those cases the neighbourhood codes have changed as well, and CBS doesn’t have conversion tables. So this is one of those cases where you need fuzzy string matching.

There’s a good Python library for that job: Fuzzywuzzy. It was developed by SeatGeek, a company that scrapes event data from a variety of websites and needed a way to figure out which titles refer to the same event, even if the names have typos and other inconsistencies.

Fuzzywuzzy will compare two strings and compute a score between 0 and 100 reflecting how similar they are. It can use different methods to calculate that score (e.g. fuzz.ratio(string_1, string_2) or fuzz.partial_ratio(string_1, string_2). Some of those methods are described in this article, which is worth a read.

Alternatively, you can take a string and have Fuzzywuzzy pick the best match(es) from a list of options (e.g., process.extract(string, list_of_strings, limit=3) or process.extractOne(string, list_of_strings)). Here, too, you could specify the method to calculate the score, but you may want to first try the default option (WRatio), which will figure out which method to use. The default option seems to work pretty well.

Here’s the code I used to match the 2010 CBS Amsterdam neighbourhood names to those for 2011:

import pandas as pd
from fuzzywuzzy import process
 
# Prepare data
 
colnames = ['name', 'level', 'code']
 
data_2010 = pd.read_excel('../data/Kerncijfers_wijken_e_131017211256.xlsx', skiprows=4)
data_2010.columns = colnames
data_2010 = data_2010[data_2010.level == 'Buurt']
names_2010 = data_2010['name']
 
data_2011 = pd.read_excel('../data/Kerncijfers_wijken_e_131017211359.xlsx', skiprows=4)
data_2011.columns = colnames
data_2011 = data_2011[data_2011.level == 'Buurt']
names_2011 = data_2011['name']
 
# Actual matching
 
recode = {}
for name in names_10:
    best_match = process.extractOne(name, names_11)
    if best_match[1] < 100:
        print(name, best_match)
    recode[name] = best_match[0]
 

It prints all matches with a score below 100 so you can inspect them in case there are any incorrect matches (with larger datasets this may not be feasible). With the process option I didn’t get any incorrect matches, but with fuzz.partial_ratio, IJplein en Vogelbuurt was matched with Vondelbuurt instead of Ijplein/Vogelbuurt.

PS In case you’re actually going to work with the local CBS data, you should know that Amsterdam’s neighbourhoods (buurten) were reclassified as districts (wijken) in 2016, when a more detailed set of neighbourhoods was introduced. You can translate 2015 neighbourhood codes to 2016 district codes:

def convert_code(x):
    x = 'WK' + x[2:]
    x = x[:6] + x[-2:]
    return x

Python script to import .sps files

In a post about voting locations (in Dutch) I grumbled a bit about inconsistencies in how Statistics Netherlands (CBS) spells the names of municipalities and why don’t they include the municipality codes in their data exports. This afternoon, someone who works at CBS responded on Twitter. She had asked around and found a workaround: download the data as SPSS. Thanks!

CBS offers the option to download data as an SPSS syntax file (.sps). I wasn’t familiar with this filetype, I don’t have SPSS and I couldn’t immediately find a package to import this filetype. But it turns out that .sps files are just text files, so I wrote a little script that does the job.

Note that it’s not super fast; there may be more efficient ways to do the job. Also, I’ve only tested it on a few CBS data files. I’m not sure it’ll work correctly if all variables have labels or if the file contains not just data but also statistical analysis.

That said, you can find the script here.

How to automate extracting tables from PDFs, using Tabula

One of my colleagues needs tables extracted from a few hundred PDFs. There’s an excellent tool called Tabula that I frequently use, but you have to process each PDF manually. However, it turns out you can also automate the process. For those like me who didn’t know, here’s how it works.

Command line tool

You can download tabula-java’s jar here (I had no idea what a jar is, but apparently it’s a format to bundle Java files). You also need a recent version of Java. Note that on a Mac, Terminal may still use an old version of Java even if you have a newer version installed. The problem and how to solve it are discussed here.

For this example, create a project folder and store the jar in a subfolder script. Store the PDFs you want to process in a subfolder data/pdf and create an empty subfolder data/csv.

On a Mac, open Terminal, use cd to navigate to your project folder and run the following code (make sure the version number of the tabula jar is correct):

for i in data/pdf/*.pdf; do java -jar script/tabula-0.9.2-jar-with-dependencies.jar -n -p all -a 29.75,43.509,819.613,464.472 -o ${i//pdf/csv} $i; done

On Windows, open the command prompt, use cd to navigate to your project folder and run the following code (again, make sure the version number of the tabula jar is correct):

for %i in (data/pdf/*.pdf) do java -jar script/tabula-0.9.2-jar-with-dependencies.jar -n -p all -a 29.75,43.509,819.613,464.472 -o data/csv/%~ni.csv data/pdf/%i

The settings you can use are described here. The examples above use the following settings:

  • -n: stands for nospreadsheet; use this if the tables in the PDF don’t have gridlines.
  • -p all: look for tables in all pages of the document. Alternatively, you can specify specific pages.
  • -a (area): the portion of the page to analyse; default is the entire page. You can choose to omit this setting, which may be a good idea when the location or size of tables varies. On the other hand, I‘ve had a file where tables from one specific page were not extracted unless I set the area variable. The area is defined by coordinates that you can obtain by analysing one PDF manually with the Tabula app and exporting the result not as csv, but as script.
  • -o: the name of the file to write the csv to.

In my experience, you may need to tinker a bit with the settings to get the results right. Even so, Tabula will sometimes get the rows right but incorrectly or inconsistently identify cells within a row. You may be able to solve this using regex.

Python (and R)

There’s a Python wrapper, tabula-py that will turn PDF tables into Pandas dataframes. As with tabula-java, you need a recent version of Java. Here’s an example of how you can use tabula-py:

import tabula
import os
import pandas as pd
 
folder = 'data/pdf/'
paths = [folder + fn for fn in os.listdir(folder) if fn.endswith('.pdf')]
for path in paths:
    df = tabula.read_pdf(path, encoding = 'latin1', pages = 'all', area = [29.75,43.509,819.613,464.472], nospreadsheet = True)
    path = path.replace('pdf', 'csv')
    df.to_csv(path, index = False)

Using the Python wrapper, I needed to specify the encoding. I ran into a problem when I tried to extract tables with varying sizes from multi-page PDFs. I think it’s the same problem as reported here. From the response, I gather the problem may be addressed in future versions of tabula-py.

For those who use R, there’s also an R wrapper for tabula, tabulizer. I haven’t tried it myself.

Call tabula-java from Python

[Update 2 May 2017] - I realised there’s another way, which is to call tabula-java from Python. Here’s an example:

import os
 
pdf_folder = 'data/pdf'
csv_folder = 'data/csv'
 
base_command = 'java -jar tabula-0.9.2-jar-with-dependencies.jar -n -p all -f TSV -o {} {}'
 
for filename in os.listdir(pdf_folder):
    pdf_path = os.path.join(pdf_folder, filename)
    csv_path = os.path.join(csv_folder, filename.replace('.pdf', '.csv'))
    command = base_command.format(csv_path, pdf_path)
    os.system(command)

This solves tabula-py’s problem with multipage pdf’s containing tables with varying sizes.

New Python package for downloading and analysing street networks

stationsplein

The image above shows square mile diagrams of cyclable routes in the area around the Stationsplein in Amsterdam, the Hague, Rotterdam and Utrecht. I made the maps with OSMnx, a Python package created by Geoff Boeing, a PhD candidate in urban planning at UC Berkeley (via).

Square mile diagrams are a nice gimmick (with practical uses), but they’re just the tip of the iceberg of what OSMnx can do. You can use it to download administrative boundaries (e.g. the outline of Amsterdam) as well as street networks from Open Street Map. And you can analyse these networks, for example: assess their density, find out which streets are connections between separate clusters in the network, or show which parts of the city have long or short blocks (I haven’t tried doing network measure calculations yet).

Boeing boasts that his package not only offers functionality that wasn’t (easily) available yet, but also that many tasks can be performed with a single line of code. From what I’ve seen so far, it’s true: the package is amazingly easy to use. All in all, I think this is a great tool.

Pages