champagne anarchist | armchair activist

howTo

How to use Python and Selenium for scraping election results

A while ago, I needed the results of last year’s Lower House election in the Netherlands, by municipality. Dutch election data is available from the website of the Kiesraad (Electoral Board). However, it doesn’t contain a table of results per municipality. You’ll have to collect this information from almost 400 different web pages. This calls for a webscraper.

The Kiesraad website is partly generated using javascript (I think) and therefore not easy to scrape. For this reason, this seemed like a perfect project to explore Selenium.

What’s Selenium? «Selenium automates browsers. That’s it!» Selenium is primarily a tool for testing web applications. However, as a tutorial by Thiago Marzagão explains, it can also be used for webscraping:

[S]ome websites don’t like to be webscraped. In these cases you may need to disguise your webscraping bot as a human being. Selenium is just the tool for that. Selenium is a webdriver: it takes control of your browser, which then does all the work.

Selenium can be used with Python. Instructions to install Selenium are here. You also have to download chromedriver or another driver; you may store it in /usr/local/bin/.

Once you have everything in place, this is how you launch the driver and load a page:

from selenium import webdriver
 
URL = 'https://www.verkiezingsuitslagen.nl/verkiezingen/detail/TK20170315'
 
browser = webdriver.Chrome()
browser.get(URL)

This will open a new browser window. You can use either xpath or css selectors to find elements and then interact with them. For example, find a dropdown menu, identify the options from the menu and select the second one:

XPATH_PROVINCES = '//*[@id="search"]/div/div[1]/div'
element = browser.find_element_by_xpath(XPATH_PROVINCES)
options = element.find_elements_by_tag_name('option')
options[1].click()

If you’d check the page source of the web page, you wouldn’t find the options of the dropdown menu; they’re added afterwards. With Selenium, you needn’t worry about that - it will load the options for you.

Well, actually, there’s a bit more to it: you can’t find and select the options until they’ve actually loaded. Likely, the options won’t be in place initially, so you’ll need to wait a bit and retry.

Selenium comes with functions that specify what it should wait for, and how long it should wait and retry before it throws an error. But this isn’t always straightforward, as Marzagão explains:

Deciding what elements to (explicitly) wait for, with what conditions, and for how long is a trial-and-error process. […] This is often a frustrating process and you’ll need patience. You think that you’ve covered all the possibilities and your code runs for an entire week and you are all happy and celebratory and then on day #8 the damn thing crashes. The servers went down for a millisecond or your Netflix streaming clogged your internet connection or whatnot. It happens.

I ran into pretty similar problems when I tried to scrape the Kiesraad website. I tried many variations of the built-in wait parameters, but without any success. In the end I decided to write a few custom functions for the purpose.

The example below looks up the options of a dropdown menu. As long as the number of options isn’t greater than 1 (the page initially loads with only one option, a dash, and other options are loaded subsequently), it will wait a few seconds and try again - until more options are found or until a maximum number of tries has been reached.

MAX_TRIES = 15
 
def count_options(xpath, browser):
 
    time.sleep(3)
    tries = 0
    while tries < MAX_TRIES:
 
        try:
            element = browser.find_element_by_xpath(xpath)
            count = len(element.find_elements_by_tag_name('option'))
            if count > 1:
                return count
        except:
            pass
 
        time.sleep(1)
        tries += 1
    return count

Here’s a script that will download and save the result pages of all cities for the March 2017 Lower House election, parse the html, and store the results as a csv file. Run it from a subfolder in your project folder.

Notes

Dutch election results are provided by the Kiesraad as open data. In the past, the Kiesraad website used to provide a csv with the results of all the municipalities, but this option is no longer available. Alternatively, a download is available of datasets for each municipality, but at least for 2017, municipalities use different formats.

Scraping the Kiesraad website appears to be the only way to get uniform data per municipality.

Since I originally wrote the scraper, the Kiesraad website has been changed. As a result, it would now be possible to scrape the site in a much easier way, and there would be no need to use Selenium. The source code of the landing page for an election contains a dictionary with id numbers for all the municipalities. With those id numbers, you can create urls for their result pages. No clicking required.

Embedding tweets in Leaflet popups

I just created a map showing where so-called Biro’s (small cars) are parked on the pavement and annoying people. Twitter has quite a few photos of the phenomenon. In some cases, finding their location took a bit of detective work.

First you’ll need the embed code for the tweet. You can get it manually from the Twitter website, but if you want to automate your workflow, use a url like the one below. It’ll download a bit of json containing the embed code:

https://publish.twitter.com/oembed?url=https://twitter.com/nieuwsamsterdam/status/958761072214896640

When trying to embed the tweets in Leaflet popups, I ran into a few problems:

  • When popups open, the markers didn’t properly move down. As a result, most of the popup would be outside the screen. The problem and how to solve it are described here.
  • Twitter embed code contains a script tag to load a widget. Apparently you can’t execute javascript by adding it directly to the html for the popup content, but you can add it using a selector (cf here).

Here’s the code that’ll solve both problems:

map.on('popupopen', function(e) {
    $.getScript("https://platform.twitter.com/widgets.js");
    var px = map.project(e.popup._latlng); 
    px.y -= e.popup._container.clientHeight;
    map.panTo(map.unproject(px),{animate: true});
});

You may also want to do something about the width of the popups, because otherwise they will obscure most of the map on mobile screens and it will be difficult to close a popup (which you can normally do by clicking outside of the popup). You can change the width of embedded tweets, but this will not change the width of the popup itself. A simple solution is to give popups a maxWidth of 215 (.bindPopup(html, {maxWidth: 215})).

Of course, you could also vary maxWidth depending on screen width, but I think 215px works well on all screens. Further, embedded tweets appear to have a minimum width of about 200px, so if you want popups narrower than 215px you’ll have to figure out a way to fix that.

If you embed tweets, Twitter can track people who visit your webpage. Add <meta name="twitter:dnt" content="on"> to your page and Twitter promises they won’t track your visitors. I wasn’t sure whether this should be put in the web page itself or in the html content of the popups (I opted for both).

If the popups have a somewhat spartan look and do not contain photos: Good for you! You’re probably using something like Firefox with tracking protection enabled. This blocks sites which have been identified as ‘engaging in cross-site tracking of users’ - including, apparently, platform.twitter.com.

How to make a d3js force layout stay within the chart area - even with multiple components

For a post on analysing networks of corporate control, I wanted to create some network graphs with d3.js. The new edition of Scott Murray’s great book on d3.js, which is updated to version 4, contains a good example to get you started. However, I was still struggling with some practical issues, as the chart below illustrates (reload the page to see the problem develop).

A large part of the graph drifts out of the chart area, and the problem only gets worse on a mobile screen. But I figured out some sort of solution.

As Murray explains, you can vary the strength value of the force layout. Positive values attract, negative values repel. The default value is –30. You could set d3.forceManyBody().strength(-3) to create a more compact graph.

Of course, the ideal setting will depend on screen size. You could vary the strength value according to screen width. While you’re at it, you may also want to vary the radius of nodes and the stroke-width of edges. For example with something like this:

if(w > 380){
    var strength = -3;
    var r = 3;
    var sw = 0.3;
}
else{
    var strength = -1;
    var r = 3;
    var sw = 0.15;
}

Now this may make the graph more compact, but it doesn’t solve one specific problem: components not connected to the rest of the chart will still drift out of the chart area. In my example, there are four components: a large one, and three pairs of nodes that are only connected to each other and not to the rest of the graph.

The way in which I dealt with this was to create four different graphs and attach the small components to a forceCenter at the margin of the chart area. For example, d3.forceCenter().x(0.1 * w).y(0.9 * h)) will put one of them in the bottom left corner. Here’s the result:

It’s still a lot of code - I can’t help feeling there should be a more efficient way to do this. Also, it’s slightly weird that the small components immediately freeze, whereas the large one takes its time to develop into its final shape. And the text labels could be improved. But at least it seems to work.

How to automate extracting tables from PDFs, using Tabula

One of my colleagues needs tables extracted from a few hundred PDFs. There’s an excellent tool called Tabula that I frequently use, but you have to process each PDF manually. However, it turns out you can also automate the process. For those like me who didn’t know, here’s how it works.

Command line tool

You can download tabula-java’s jar here (I had no idea what a jar is, but apparently it’s a format to bundle Java files). You also need a recent version of Java. Note that on a Mac, Terminal may still use an old version of Java even if you have a newer version installed. The problem and how to solve it are discussed here.

For this example, create a project folder and store the jar in a subfolder script. Store the PDFs you want to process in a subfolder data/pdf and create an empty subfolder data/csv.

On a Mac, open Terminal, use cd to navigate to your project folder and run the following code (make sure the version number of the tabula jar is correct):

for i in data/pdf/*.pdf; do java -jar script/tabula-0.9.2-jar-with-dependencies.jar -n -p all -a 29.75,43.509,819.613,464.472 -o ${i//pdf/csv} $i; done

On Windows, open the command prompt, use cd to navigate to your project folder and run the following code (again, make sure the version number of the tabula jar is correct):

for %i in (data/pdf/*.pdf) do java -jar script/tabula-0.9.2-jar-with-dependencies.jar -n -p all -a 29.75,43.509,819.613,464.472 -o data/csv/%~ni.csv data/pdf/%i

The settings you can use are described here. The examples above use the following settings:

  • -n: stands for nospreadsheet; use this if the tables in the PDF don’t have gridlines.
  • -p all: look for tables in all pages of the document. Alternatively, you can specify specific pages.
  • -a (area): the portion of the page to analyse; default is the entire page. You can choose to omit this setting, which may be a good idea when the location or size of tables varies. On the other hand, I‘ve had a file where tables from one specific page were not extracted unless I set the area variable. The area is defined by coordinates that you can obtain by analysing one PDF manually with the Tabula app and exporting the result not as csv, but as script.
  • -o: the name of the file to write the csv to.

In my experience, you may need to tinker a bit with the settings to get the results right. Even so, Tabula will sometimes get the rows right but incorrectly or inconsistently identify cells within a row. You may be able to solve this using regex.

Python (and R)

There’s a Python wrapper, tabula-py that will turn PDF tables into Pandas dataframes. As with tabula-java, you need a recent version of Java. Here’s an example of how you can use tabula-py:

import tabula
import os
import pandas as pd
 
folder = 'data/pdf/'
paths = [folder + fn for fn in os.listdir(folder) if fn.endswith('.pdf')]
for path in paths:
    df = tabula.read_pdf(path, encoding = 'latin1', pages = 'all', area = [29.75,43.509,819.613,464.472], nospreadsheet = True)
    path = path.replace('pdf', 'csv')
    df.to_csv(path, index = False)

Using the Python wrapper, I needed to specify the encoding. I ran into a problem when I tried to extract tables with varying sizes from multi-page PDFs. I think it’s the same problem as reported here. From the response, I gather the problem may be addressed in future versions of tabula-py.

For those who use R, there’s also an R wrapper for tabula, tabulizer. I haven’t tried it myself.

Call tabula-java from Python

[Update 2 May 2017] - I realised there’s another way, which is to call tabula-java from Python. Here’s an example:

import os
 
pdf_folder = 'data/pdf'
csv_folder = 'data/csv'
 
base_command = 'java -jar tabula-0.9.2-jar-with-dependencies.jar -n -p all -f TSV -o {} {}'
 
for filename in os.listdir(pdf_folder):
    pdf_path = os.path.join(pdf_folder, filename)
    csv_path = os.path.join(csv_folder, filename.replace('.pdf', '.csv'))
    command = base_command.format(csv_path, pdf_path)
    os.system(command)

This solves tabula-py’s problem with multipage pdf’s containing tables with varying sizes.

Embedding D3.js charts in a responsive website

UPDATE - better approach here.

For a number of reasons, I like to use D3.js for my charts. However, I’ve been struggling for a while to get them to behave properly on my blog which has a responsive theme. I’ve tried quite a few solutions from Stack Overflow and elsewhere but none seemed to work.

I want to embed the chart using an iframe. The width of the iframe should adapt to the column width and the height to the width of the iframe, maintaining the aspect ratio of the chart. The chart itself should fill up the iframe. Preferably, when people rotate their phone, the size of the iframe and its contents should update without the need to reload the entire page.

Styling the iframe

Smashing Magazine has described a solution for embedding videos. You enclose the iframe in a div and use css to add a padding of, say, 40% to that div (the percentage depending on the aspect ratio you want). You can then set both width and height of the iframe itself to 100%. Here’s an adapted version of the code:

<style>
.container_chart_1 {
    position: relative;
    padding-bottom: 40%;
    height: 0;
    overflow: hidden;
}
 
.container_chart_1 iframe {
    position: absolute;
    top:0;
    left: 0;
    width: 100%;
    height: 100%;
}
</style>
 
<div class ='container_chart_1'>
<iframe src='https://dirkmjk.nl/2016/embed_d3/chart_1.html' frameborder='0' scrolling = 'no' id = 'iframe_chart_1'>
</iframe>
</div>

Making the chart adapt to the iframe size

The next question is how to make the D3 chart adapt to the dimensions of the iframe. Here’s what I thought might work but didn’t: in the chart, obtain the dimensions of the iframe using window.innerWidth and window.innerHeight (minus 16px - something to do with scrollbars apparently?) and use those to define the size of your chart.

Using innerWidth and innerHeight seemed to work - until I tested it on my iPhone. Upon loading a page it starts out OK, but then the update function increases the size of the chart until only a small detail is visible in the iframe (rotate your phone to replicate this). Apparently, iOS returns not the dimensions of the iframe but something else when innerWidth and innerHeight are used. I didn’t have that problem when I tested on an Android phone.

Adapt to the iframe size: Alternative solution

Here’s an alternative approach for making the D3 chart adapt to the dimensions of the iframe. Set width to the width of the div that the chart is appended to (or to the width of the body) and set height to width * aspect ratio. Here’s the relevant code:

var aspect_ratio = 0.4;
var frame_width = $('#chart_2').width();
var frame_height = aspect_ratio * frame_width;

The disadvantage of this approach is that you’ll have to set the aspect ratio in two places: both in the css for the div containing the iframe and in the html-page that is loaded in the iframe. So if you decide to change the aspect ratio, you’ll have to change it in both places. Other than that, it appears to work.

Reloading the chart upon window resize

Then write a function that reloads the iframe content upon window resize, so as to adapt the size of the chart when people rotate their phone. Note that on mobile devices, scrolling may trigger the window resize. You don’t want to reload the contents of the iframe each time someone scrolls the page. To prevent this, you may add a check whether the window width has changed (a trick I picked up here). Also note that with Drupal, you need to use jQuery instead of $.

width = jQuery(window).width;
jQuery(window).resize(function(){
    if(jQuery(window).width() != width){
        document.getElementById('iframe_chart_1').src = document.getElementById('iframe_chart_1').src;
        width = jQuery(window).width;
    }
});

In case you know a better way - do let me know!

FYI, here’s the chart used as illustration in its original context.

Pages