Data

Dutch governments consider using Strava data

Strava is a popular app to record bicycle rides. For some years, the company has been trying to sell its data to local governments for traffic planning. NDW, a platform of Dutch governments including the city of Amsterdam, has bought six months’ worth of Strava data to give it a try.

The switch to Strava may mean the end of the Fietstelweek, an annual one-week effort to collect bicycle data from thousands of volunteers. In the past, I’ve used Fietstelweek data to analyse waiting times at traffic lights. The Fietstelweek received funding from the same governments that are now experimenting with Strava data.

One reason why they are looking for alternatives is that the number of Fietstelweek participants is lower than they’d like. They seem to have a point. Consider for example the map below, which shows bicycle routes to and from Amsterdam Central Station.

As such, it’s an interesting map. Unsuprisingly, it seems that intensity is highest near the bicycle parking facilities. Main access routes appear to be the Geldersekade (with the sometimes chaotic crossing with Prins Hendrikkade) and the Piet Heinkade. It seems that people cycling to and from Central Station are somewhat more likely to live in the eastern part of the city.

There’s one caveat though: the numbers are small. Even the busiest segments represent at most 40 rides. One loyal Fietstelweek participant recording her commute during the entire week could literally change the map.

Strava has far larger numbers, but its data raises different kinds of questions. Strava calls itself ‘the social network for athletes’ and wants to know if you use a road bike, a mountain bike, a TT bike or a cyclocross bike (no option ‘other’ available). So how representative is Strava data of people who use their city bike for commutes and other practical purposes?

Strava’s response to such questions is that they’re trying to make the app less competition-focused and more social, with Facebook-like features. This should help them collect data about ‘normal’ bike rides. They have also argued that «especially in cities, those with the app tended to ride the same routes as everyone else».

But is that really true? Strava’s heatmap (choose red and rides) for Amsterdam could perhaps be interpreted as a combination of recreational rides (Vondelpark, Amstel) and cyclists trying to get in or out of the city as quickly as possible (plus quite a few people who recorded their laps at the Jaap Eden ice skating rink as bicycle rides).

Perhaps you could find a way to filter out ‘lycra’ rides and end up with a sufficient number of ‘normal’ rides. Then again, almost three-quarters of bicycle rides in the Netherlands are under 3.7 km, and I suspect very few of those short rides end up on Strava.

There’s also a socio-economic aspect. It has been argued that Strava is used most by people living in wealthier neighbourhoods, which aren’t necessarily the neighbourhoods most in need of better cycling infrastructure.

Of course, bicycle use is unequal in the first place, which is also reflected in Fietstelweek data. The map below shows the start and end points of rides for Amsterdam.

Density is highest in the area within the ring road and south of the IJ. The number of trips per 1,000 residents also correlates with house values: more bicycle trips start or end in affluent neighbourhoods. As said, this probably reflects actual patterns in bicycle use and not a problem of the data.

To summarise, Fietstelweek has smaller numbers than one would like, while Strava data raises questions about representativeness. One way for Strava to help answer these questions would be to make a subset of its Amsterdam data available as open data.

This Python script shows how the analysis was done.

The Digital City

Amsterdam has a new coalition agreement. The paragraph on democratisation and the Digital City was well received - a London-based researcher from Amsterdam liked the plans so much she decided to translate them into English.

The new coalition wants to create a democratic version of the smart city. Citizens should be in control of their data. The city will support co-operations that provide an alternative to platform monopolists. An information commissioner will see to it that the principles ‘open by default’ and ‘privacy by design’ are implemented.

The agreement also lists a number of issues the city is (or was) already working on:

  • City council information will be opened up. In 2015, the city council asked to make documents such as council meeting reports, motions, written questions etc available as open data. Since, some of that data has been made available through Open Raadsinformatie, but as yet no solution has been found for offering all council information in a machine readable form.
  • Freedom of information requests (Wob requests) will be published. Amsterdam started publishing decisions on Wob requests earlier this year; so far three have been [published][wob].
  • Interestingly, Amsterdam wants to use open source software whenever possible. Over ten years ago, Amsterdam planned a similar move, and its plans were sufficiently serious to needle Microsoft. In 2010, the plans foundered on an uncooperative IT department.

All in all, a nice combination of new ambitions and implementation of ‘old’ plans.

De digitale stad

Amsterdam heeft een nieuw coalitieakkoord. De passage over democratisering en de Digitale Stad is goed ontvangen - een Amsterdams-Britse onderzoeker was zo enthousiast dat ze een Engelse vertaling maakte.

De nieuwe coalitie wil een democratische invulling geven aan de smart city. Burgers moeten meer controle krijgen over hun data, er worden niet meer gegevens verzameld dan nodig en er komt steun voor coöperaties die een alternatief willen bieden voor platformmonopolisten. Een informatiecommissaris gaat erop toezien dat de uitgangspunten ‘open tenzij’ en ‘privacy by design’ worden waargemaakt.

Het akkoord noemt ook een aantal onderwerpen waar de gemeente al mee bezig is (of was):

  • Raadsinformatie wordt toegankelijk. In 2015 heeft de gemeenteraad daar al om gevraagd. Sindsdien is een deel van de raadsinformatie beschikbaar via Open Raadsinformatie, maar er is nog geen goede oplossing om alle raadsinformatie in een goed doorzoekbare en analyseerbare vorm beschikbaar te stellen.
  • Wob-verzoeken en documenten worden toegankelijk. Hier is ook al een begin mee gemaakt. Tot nog toe zijn er drie Wob-besluiten gepubliceerd.
  • Ook interessant is dat Amsterdam zo veel mogelijk met open-sourcesoftware gaat werken. Ruim tien jaar geleden was Amsterdam dat al van plan, en dat was serieus genoeg om Microsoft op de kast te jagen. In 2010 leden de plannen schipbreuk op ambtelijke weerstand.

Al met al een mooie mix van nieuwe ambities en uitvoering van ‘oude’ plannen.

Dutch government drops 3D pie charts

The Dutch government has replaced pie charts with bar charts in it’s annual reports, someone noted on Twitter (via @bokami). Pie charts aren’t always a bad choice - contrary to the view of some adherents of the stricter school in data visualisation. But 3D pie charts are really hard to justify and it’s a bit awkward they were still used one year ago.

The charts are from the 2016 and 2017 reports of the Department of Social Affairs and Employment.

How to use Python and Selenium for scraping election results

A while ago, I needed the results of last year’s Lower House election in the Netherlands, by municipality. Dutch election data is available from the website of the Kiesraad (Electoral Board). However, it doesn’t contain a table of results per municipality. You’ll have to collect this information from almost 400 different web pages. This calls for a webscraper.

The Kiesraad website is partly generated using javascript (I think) and therefore not easy to scrape. For this reason, this seemed like a perfect project to explore Selenium.

What’s Selenium? «Selenium automates browsers. That’s it!» Selenium is primarily a tool for testing web applications. However, as a tutorial by Thiago Marzagão explains, it can also be used for webscraping:

[S]ome websites don’t like to be webscraped. In these cases you may need to disguise your webscraping bot as a human being. Selenium is just the tool for that. Selenium is a webdriver: it takes control of your browser, which then does all the work.

Selenium can be used with Python. Instructions to install Selenium are here. You also have to download chromedriver or another driver; you may store it in /usr/local/bin/.

Once you have everything in place, this is how you launch the driver and load a page:

from selenium import webdriver
 
URL = 'https://www.verkiezingsuitslagen.nl/verkiezingen/detail/TK20170315'
 
browser = webdriver.Chrome()
browser.get(URL)

This will open a new browser window. You can use either xpath or css selectors to find elements and then interact with them. For example, find a dropdown menu, identify the options from the menu and select the second one:

XPATH_PROVINCES = '//*[@id="search"]/div/div[1]/div'
element = browser.find_element_by_xpath(XPATH_PROVINCES)
options = element.find_elements_by_tag_name('option')
options[1].click()

If you’d check the page source of the web page, you wouldn’t find the options of the dropdown menu; they’re added afterwards. With Selenium, you needn’t worry about that - it will load the options for you.

Well, actually, there’s a bit more to it: you can’t find and select the options until they’ve actually loaded. Likely, the options won’t be in place initially, so you’ll need to wait a bit and retry.

Selenium comes with functions that specify what it should wait for, and how long it should wait and retry before it throws an error. But this isn’t always straightforward, as Marzagão explains:

Deciding what elements to (explicitly) wait for, with what conditions, and for how long is a trial-and-error process. […] This is often a frustrating process and you’ll need patience. You think that you’ve covered all the possibilities and your code runs for an entire week and you are all happy and celebratory and then on day #8 the damn thing crashes. The servers went down for a millisecond or your Netflix streaming clogged your internet connection or whatnot. It happens.

I ran into pretty similar problems when I tried to scrape the Kiesraad website. I tried many variations of the built-in wait parameters, but without any success. In the end I decided to write a few custom functions for the purpose.

The example below looks up the options of a dropdown menu. As long as the number of options isn’t greater than 1 (the page initially loads with only one option, a dash, and other options are loaded subsequently), it will wait a few seconds and try again - until more options are found or until a maximum number of tries has been reached.

MAX_TRIES = 15
 
def count_options(xpath, browser):
 
    time.sleep(3)
    tries = 0
    while tries < MAX_TRIES:
 
        try:
            element = browser.find_element_by_xpath(xpath)
            count = len(element.find_elements_by_tag_name('option'))
            if count > 1:
                return count
        except:
            pass
 
        time.sleep(1)
        tries += 1
    return count

Here’s a script that will download and save the result pages of all cities for the March 2017 Lower House election, parse the html, and store the results as a csv file. Run it from a subfolder in your project folder.

UPDATE 23 February 2019 - improved version of the script here.

Notes

Dutch election results are provided by the Kiesraad as open data. In the past, the Kiesraad website used to provide a csv with the results of all the municipalities, but this option is no longer available. Alternatively, a download is available of datasets for each municipality, but at least for 2017, municipalities use different formats.

Scraping the Kiesraad website appears to be the only way to get uniform data per municipality.

Since I originally wrote the scraper, the Kiesraad website has been changed. As a result, it would now be possible to scrape the site in a much easier way, and there would be no need to use Selenium. The source code of the landing page for an election contains a dictionary with id numbers for all the municipalities. With those id numbers, you can create urls for their result pages. No clicking required.

Pages