Python

Converting Election Markup Language (EML) to csv

Note that the map above isn’t really a good illustration here because I used a different data source to create it.

Getting results of Dutch elections at the municipality level can be complicated, but what if you want to dig a little deeper and look at results per polling station? Or even per candidate, per polling station? For elections since 2009, that information is available from the data portal of the Dutch government.

Challenges

The data is in Election Markup Language, an international standard for election data. I didn’t know that format and processing the data posed a bit of a challenge. I couldn’t find a simple explanation of the data structure, and the Electoral Board states that it doesn’t provide support on the format.

For example, how do you connect a candidate ID to their name and other details? I think you need to identify the Kieskring (district) by the contest name of the results file. Then, find the candidate list for the Kieskring and look up the candidate’s details using their candidate ID and affiliation. But with municipal elections, you have to look up candidates in the city’s candidate list (which doesn’t seem to have a contest name).

Practical tips

If you plan to use the data, here are some practical tips:

  • Keep in mind that locations and names of polling stations may change between elections.
  • If you want to geocode the polling stations, the easiest way is to use the postcode, which is often added to the polling station name (only for recent elections). If the postcode is not available or if you need a more precise location, the lists of polling station names and locations provided by Open State (2017, 2018) may be of use. Use fuzzy matching to match on polling station name, or perhaps you could also match on postcode if available. Of course, such an approach is not entirely error-free.

Further, note that the data for the 2017 Lower House election is only available in EML format for some of the municipalities. I guess this has something to do with the fact that prior to the election, vulnerabilities had been discovered in software to count the votes, so they had to count the votes manually.

Python script

Here’s a Python script that converts EML files to csv. See caveats there.

The orientation of Amsterdam’s streets

Eight days from now, Amsterdam will have a new metro line traversing the city from north to south. But what about the orientation of the city’s streets?

Geoff Boeing - who created a Python package for analysing street networks using data from OpenStreetMap - just published a series of polar histograms of American and ‘world’ cities. Amsterdam isn’t among them, but Boeing made his code available, so I used that to create charts for the largest cities in the Netherlands.

While the pattern isn’t nearly as monotonous as in most American cities, I’m still surprised how many streets in Amsterdam run from north to south or from east to west. The Hague has a strong diagonal orientation; Rotterdam doesn’t seem to have a dominant orientation and Utrecht is a bit in between.

With Boeing’s code, you can also do the analysis specifically for roads that are accessible to cyclists, but for Amsterdam that doesn’t make much difference since most roads are.

Discussion

15 July 2018 - There was some really interesting discussion on Twitter in response to my post from last Friday (I use Twitter names to refer to people; most sources are in Dutch).

Curved streets

Both Sanne and Egon Willighagen asked how the chart treats curved streets. I have to admit I hadn’t checked, but the docstring of the add_ege_bearings function explains that it calculates the compass bearing of edges from origin node to destination node, so that implies that streets are treated as if they were straight lines.

Is that a problem? Probably not for many US cities, for they seem to have few curved streets. As for Amsterdam: most people’s mental image of the city is probably dominated by the curved canals of the city centre. However, many neighbourhoods consist of grids of more or less straight streets. So perhaps curved streets have little impact on the analysis after all.

Length versus surface

Hans Wisbrun argues that the chart type is nice, but also deceptive. The number of streets is represented by the length of the wedges, but one may intuitively look at the surface, which increases with the square of the length. In a post from 2013 (based on a tip from Ionica Smeets), he used a chart by Florence Nightingale to discuss the problem.

Rogier Brussee agrees, but argues that a polar chart is still the right choice here, because what you want to show is the angle of streets.

In a more general sense, I think the charts are an exploratory tool that’ll give you an idea how street patterns differ between cities. If you really want to understand what the wedges represent, you’ll have to look at a map.

Beach ridges

That’s what Stephan Okhuijsen did. He noted that the chart for The Hague appears to reflect the orientation of the city’s coastline. Not quite, Christiaan Jacobs replied. The orientation of the city’s streets is not determined by the current coastline, but by the original beach ridges.

I don’t know much about geography (or about The Hague for that matter), but a bit of googling suggests Jacobs is right. See for example this map (from this detailed analysis of one of The Hague’s streets), with the old sand dunes shown in dark yellow.

See also links to previous similar work in this post by Nathan Yau (FlowingData).

How to use Python and Selenium for scraping election results

A while ago, I needed the results of last year’s Lower House election in the Netherlands, by municipality. Dutch election data is available from the website of the Kiesraad (Electoral Board). However, it doesn’t contain a table of results per municipality. You’ll have to collect this information from almost 400 different web pages. This calls for a webscraper.

The Kiesraad website is partly generated using javascript (I think) and therefore not easy to scrape. For this reason, this seemed like a perfect project to explore Selenium.

What’s Selenium? «Selenium automates browsers. That’s it!» Selenium is primarily a tool for testing web applications. However, as a tutorial by Thiago Marzagão explains, it can also be used for webscraping:

[S]ome websites don’t like to be webscraped. In these cases you may need to disguise your webscraping bot as a human being. Selenium is just the tool for that. Selenium is a webdriver: it takes control of your browser, which then does all the work.

Selenium can be used with Python. Instructions to install Selenium are here. You also have to download chromedriver or another driver; you may store it in /usr/local/bin/.

Once you have everything in place, this is how you launch the driver and load a page:

from selenium import webdriver
 
URL = 'https://www.verkiezingsuitslagen.nl/verkiezingen/detail/TK20170315'
 
browser = webdriver.Chrome()
browser.get(URL)

This will open a new browser window. You can use either xpath or css selectors to find elements and then interact with them. For example, find a dropdown menu, identify the options from the menu and select the second one:

XPATH_PROVINCES = '//*[@id="search"]/div/div[1]/div'
element = browser.find_element_by_xpath(XPATH_PROVINCES)
options = element.find_elements_by_tag_name('option')
options[1].click()

If you’d check the page source of the web page, you wouldn’t find the options of the dropdown menu; they’re added afterwards. With Selenium, you needn’t worry about that - it will load the options for you.

Well, actually, there’s a bit more to it: you can’t find and select the options until they’ve actually loaded. Likely, the options won’t be in place initially, so you’ll need to wait a bit and retry.

Selenium comes with functions that specify what it should wait for, and how long it should wait and retry before it throws an error. But this isn’t always straightforward, as Marzagão explains:

Deciding what elements to (explicitly) wait for, with what conditions, and for how long is a trial-and-error process. […] This is often a frustrating process and you’ll need patience. You think that you’ve covered all the possibilities and your code runs for an entire week and you are all happy and celebratory and then on day #8 the damn thing crashes. The servers went down for a millisecond or your Netflix streaming clogged your internet connection or whatnot. It happens.

I ran into pretty similar problems when I tried to scrape the Kiesraad website. I tried many variations of the built-in wait parameters, but without any success. In the end I decided to write a few custom functions for the purpose.

The example below looks up the options of a dropdown menu. As long as the number of options isn’t greater than 1 (the page initially loads with only one option, a dash, and other options are loaded subsequently), it will wait a few seconds and try again - until more options are found or until a maximum number of tries has been reached.

MAX_TRIES = 15
 
def count_options(xpath, browser):
 
    time.sleep(3)
    tries = 0
    while tries < MAX_TRIES:
 
        try:
            element = browser.find_element_by_xpath(xpath)
            count = len(element.find_elements_by_tag_name('option'))
            if count > 1:
                return count
        except:
            pass
 
        time.sleep(1)
        tries += 1
    return count

Here’s a script that will download and save the result pages of all cities for the March 2017 Lower House election, parse the html, and store the results as a csv file. Run it from a subfolder in your project folder.

Notes

Dutch election results are provided by the Kiesraad as open data. In the past, the Kiesraad website used to provide a csv with the results of all the municipalities, but this option is no longer available. Alternatively, a download is available of datasets for each municipality, but at least for 2017, municipalities use different formats.

Scraping the Kiesraad website appears to be the only way to get uniform data per municipality.

Since I originally wrote the scraper, the Kiesraad website has been changed. As a result, it would now be possible to scrape the site in a much easier way, and there would be no need to use Selenium. The source code of the landing page for an election contains a dictionary with id numbers for all the municipalities. With those id numbers, you can create urls for their result pages. No clicking required.

How to do fuzzy matching in Python

Statistics Netherlands (CBS) has an interesting dataset containing data at the city, district and neighbourhood levels. However, some names of neighbourhoods have changed, specifically between 2010 and 2011 for Amsterdam. For example, Bijlmer-Centrum D, F en H was renamed Bijlmer-Centrum (D, F, H).

In some of those cases the neighbourhood codes have changed as well, and CBS doesn’t have conversion tables. So this is one of those cases where you need fuzzy string matching.

There’s a good Python library for that job: Fuzzywuzzy. It was developed by SeatGeek, a company that scrapes event data from a variety of websites and needed a way to figure out which titles refer to the same event, even if the names have typos and other inconsistencies.

Fuzzywuzzy will compare two strings and compute a score between 0 and 100 reflecting how similar they are. It can use different methods to calculate that score (e.g. fuzz.ratio(string_1, string_2) or fuzz.partial_ratio(string_1, string_2). Some of those methods are described in this article, which is worth a read.

Alternatively, you can take a string and have Fuzzywuzzy pick the best match(es) from a list of options (e.g., process.extract(string, list_of_strings, limit=3) or process.extractOne(string, list_of_strings)). Here, too, you could specify the method to calculate the score, but you may want to first try the default option (WRatio), which will figure out which method to use. The default option seems to work pretty well.

Here’s the code I used to match the 2010 CBS Amsterdam neighbourhood names to those for 2011:

import pandas as pd
from fuzzywuzzy import process
 
# Prepare data
 
colnames = ['name', 'level', 'code']
 
data_2010 = pd.read_excel('../data/Kerncijfers_wijken_e_131017211256.xlsx', skiprows=4)
data_2010.columns = colnames
data_2010 = data_2010[data_2010.level == 'Buurt']
names_2010 = data_2010['name']
 
data_2011 = pd.read_excel('../data/Kerncijfers_wijken_e_131017211359.xlsx', skiprows=4)
data_2011.columns = colnames
data_2011 = data_2011[data_2011.level == 'Buurt']
names_2011 = data_2011['name']
 
# Actual matching
 
recode = {}
for name in names_10:
    best_match = process.extractOne(name, names_11)
    if best_match[1] < 100:
        print(name, best_match)
    recode[name] = best_match[0]
 

It prints all matches with a score below 100 so you can inspect them in case there are any incorrect matches (with larger datasets this may not be feasible). With the process option I didn’t get any incorrect matches, but with fuzz.partial_ratio, IJplein en Vogelbuurt was matched with Vondelbuurt instead of Ijplein/Vogelbuurt.

PS In case you’re actually going to work with the local CBS data, you should know that Amsterdam’s neighbourhoods (buurten) were reclassified as districts (wijken) in 2016, when a more detailed set of neighbourhoods was introduced. You can translate 2015 neighbourhood codes to 2016 district codes:

def convert_code(x):
    x = 'WK' + x[2:]
    x = x[:6] + x[-2:]
    return x

Python script to import .sps files

In a post about voting locations (in Dutch) I grumbled a bit about inconsistencies in how Statistics Netherlands (CBS) spells the names of municipalities and why don’t they include the municipality codes in their data exports. This afternoon, someone who works at CBS responded on Twitter. She had asked around and found a workaround: download the data as SPSS. Thanks!

CBS offers the option to download data as an SPSS syntax file (.sps). I wasn’t familiar with this filetype, I don’t have SPSS and I couldn’t immediately find a package to import this filetype. But it turns out that .sps files are just text files, so I wrote a little script that does the job.

Note that it’s not super fast; there may be more efficient ways to do the job. Also, I’ve only tested it on a few CBS data files. I’m not sure it’ll work correctly if all variables have labels or if the file contains not just data but also statistical analysis.

That said, you can find the script here.

Pages