A while ago, I needed the results of last year’s Lower House election in the Netherlands, by municipality. Dutch election data is available from the website of the Kiesraad (Electoral Board). However, it doesn’t contain a table of results per municipality. You’ll have to collect this information from almost 400 different web pages. This calls for a webscraper.
What’s Selenium? «Selenium automates browsers. That’s it!» Selenium is primarily a tool for testing web applications. However, as a tutorial by Thiago Marzagão explains, it can also be used for webscraping:
[S]ome websites don’t like to be webscraped. In these cases you may need to disguise your webscraping bot as a human being. Selenium is just the tool for that. Selenium is a webdriver: it takes control of your browser, which then does all the work.
Once you have everything in place, this is how you launch the driver and load a page:
from selenium import webdriver URL = 'https://www.verkiezingsuitslagen.nl/verkiezingen/detail/TK20170315' browser = webdriver.Chrome() browser.get(URL)
This will open a new browser window. You can use either xpath or css selectors to find elements and then interact with them. For example, find a dropdown menu, identify the options from the menu and select the second one:
XPATH_PROVINCES = '//*[@id="search"]/div/div/div' element = browser.find_element_by_xpath(XPATH_PROVINCES) options = element.find_elements_by_tag_name('option') options.click()
If you’d check the page source of the web page, you wouldn’t find the options of the dropdown menu; they’re added afterwards. With Selenium, you needn’t worry about that - it will load the options for you.
Well, actually, there’s a bit more to it: you can’t find and select the options until they’ve actually loaded. Likely, the options won’t be in place initially, so you’ll need to wait a bit and retry.
Selenium comes with functions that specify what it should wait for, and how long it should wait and retry before it throws an error. But this isn’t always straightforward, as Marzagão explains:
Deciding what elements to (explicitly) wait for, with what conditions, and for how long is a trial-and-error process. […] This is often a frustrating process and you’ll need patience. You think that you’ve covered all the possibilities and your code runs for an entire week and you are all happy and celebratory and then on day #8 the damn thing crashes. The servers went down for a millisecond or your Netflix streaming clogged your internet connection or whatnot. It happens.
I ran into pretty similar problems when I tried to scrape the Kiesraad website. I tried many variations of the built-in wait parameters, but without any success. In the end I decided to write a few custom functions for the purpose.
The example below looks up the options of a dropdown menu. As long as the number of options isn’t greater than 1 (the page initially loads with only one option, a dash, and other options are loaded subsequently), it will wait a few seconds and try again - until more options are found or until a maximum number of tries has been reached.
MAX_TRIES = 15 def count_options(xpath, browser): time.sleep(3) tries = 0 while tries < MAX_TRIES: try: element = browser.find_element_by_xpath(xpath) count = len(element.find_elements_by_tag_name('option')) if count > 1: return count except: pass time.sleep(1) tries += 1 return count
Here’s a script that will download and save the result pages of all cities for the March 2017 Lower House election, parse the html, and store the results as a csv file. Run it from a subfolder in your project folder.
Dutch election results are provided by the Kiesraad as open data. In the past, the Kiesraad website used to provide a csv with the results of all the municipalities, but this option is no longer available. Alternatively, a download is available of datasets for each municipality, but at least for 2017, municipalities use different formats.
Scraping the Kiesraad website appears to be the only way to get uniform data per municipality.
Since I originally wrote the scraper, the Kiesraad website has been changed. As a result, it would now be possible to scrape the site in a much easier way, and there would be no need to use Selenium. The source code of the landing page for an election contains a dictionary with id numbers for all the municipalities. With those id numbers, you can create urls for their result pages. No clicking required.