How to use Python and Selenium for scraping election results
A while ago, I needed the results of last year’s Lower House election in the Netherlands, by municipality. Dutch election data is available from the website of the Kiesraad (Electoral Board). However, it doesn’t contain a table of results per municipality. You’ll have to collect this information from almost 400 different web pages. This calls for a webscraper.
The Kiesraad website is partly generated using javascript (I think) and therefore not easy to scrape. For this reason, this seemed like a perfect project to explore Selenium.
What’s Selenium? «Selenium automates browsers. That’s it!» Selenium is primarily a tool for testing web applications. However, as a tutorial by Thiago Marzagão explains, it can also be used for webscraping:
[S]ome websites don’t like to be webscraped. In these cases you may need to disguise your webscraping bot as a human being. Selenium is just the tool for that. Selenium is a webdriver: it takes control of your browser, which then does all the work.
Selenium can be used with Python. Instructions to install Selenium are here. You also have to download chromedriver or another driver; you may store it in /usr/local/bin/.
Once you have everything in place, this is how you launch the driver and load a page:
from selenium import webdriver
URL = '<a href="https://www.verkiezingsuitslagen.nl/verkiezingen/detail/TK20170315">https://www.verkiezingsuitslagen.nl/verkiezingen/detail/TK20170315</a>'
browser = webdriver.Chrome()
browser.get(URL)
XPATH_PROVINCES = '//*[@id="search"]/div/div[1]/div'
element = browser.find_element_by_xpath(XPATH_PROVINCES)
options = element.find_elements_by_tag_name('option')
options[1].click()
Deciding what elements to (explicitly) wait for, with what conditions, and for how long is a trial-and-error process. […] This is often a frustrating process and you’ll need patience. You think that you’ve covered all the possibilities and your code runs for an entire week and you are all happy and celebratory and then on day #8 the damn thing crashes. The servers went down for a millisecond or your Netflix streaming clogged your internet connection or whatnot. It happens.I ran into pretty similar problems when I tried to scrape the Kiesraad website. I tried many variations of the built-in wait parameters, but without any success. In the end I decided to write a few custom functions for the purpose. The example below looks up the options of a dropdown menu. As long as the number of options isn’t greater than 1 (the page initially loads with only one option, a dash, and other options are loaded subsequently), it will wait a few seconds and try again - until more options are found or until a maximum number of tries has been reached.
MAX_TRIES = 15
def count_options(xpath, browser):
time.sleep(3)
tries = 0
while tries < MAX_TRIES
try:
element = browser.find_element_by_xpath(xpath)
count = len(element.find_elements_by_tag_name('option'))
if count > 1:
return count
except:
pass
time.sleep(1)
tries += 1
return count
Here’s a script that will download and save the result pages of all cities for the March 2017 Lower House election, parse the html, and store the results as a csv file. Run it from a subfolder in your project folder.
UPDATE 23 February 2019 - improved version of the script here.
Notes
Dutch election results are provided by the Kiesraad as open data. In the past, the Kiesraad website used to provide a csv with the results of all the municipalities, but this option is no longer available. Alternatively, a download is available of datasets for each municipality, but at least for 2017, municipalities use different formats.
Scraping the Kiesraad website appears to be the only way to get uniform data per municipality.
Since I originally wrote the scraper, the Kiesraad website has been changed. As a result, it would now be possible to scrape the site in a much easier way, and there would be no need to use Selenium. The source code of the landing page for an election contains a dictionary with id numbers for all the municipalities. With those id numbers, you can create urls for their result pages. No clicking required.