champagne anarchist | armchair activist

How to use Python and Selenium for scraping election results

A while ago, I needed the results of last year’s Lower House election in the Netherlands, by municipality. Dutch election data is available from the website of the Kiesraad (Electoral Board). However, it doesn’t contain a table of results per municipality. You’ll have to collect this information from almost 400 different web pages. This calls for a webscraper.

The Kiesraad website is partly generated using javascript (I think) and therefore not easy to scrape. For this reason, this seemed like a perfect project to explore Selenium.

What’s Selenium? «Selenium automates browsers. That’s it!» Selenium is primarily a tool for testing web applications. However, as a tutorial by Thiago Marzagão explains, it can also be used for webscraping:

[S]ome websites don’t like to be webscraped. In these cases you may need to disguise your webscraping bot as a human being. Selenium is just the tool for that. Selenium is a webdriver: it takes control of your browser, which then does all the work.

Selenium can be used with Python. Instructions to install Selenium are here. You also have to download chromedriver or another driver; you may store it in /usr/local/bin/.

Once you have everything in place, this is how you launch the driver and load a page:

from selenium import webdriver
 
URL = 'https://www.verkiezingsuitslagen.nl/verkiezingen/detail/TK20170315'
 
browser = webdriver.Chrome()
browser.get(URL)

This will open a new browser window. You can use either xpath or css selectors to find elements and then interact with them. For example, find a dropdown menu, identify the options from the menu and select the second one:

XPATH_PROVINCES = '//*[@id="search"]/div/div[1]/div'
element = browser.find_element_by_xpath(XPATH_PROVINCES)
options = element.find_elements_by_tag_name('option')
options[1].click()

If you’d check the page source of the web page, you wouldn’t find the options of the dropdown menu; they’re added afterwards. With Selenium, you needn’t worry about that - it will load the options for you.

Well, actually, there’s a bit more to it: you can’t find and select the options until they’ve actually loaded. Likely, the options won’t be in place initially, so you’ll need to wait a bit and retry.

Selenium comes with functions that specify what it should wait for, and how long it should wait and retry before it throws an error. But this isn’t always straightforward, as Marzagão explains:

Deciding what elements to (explicitly) wait for, with what conditions, and for how long is a trial-and-error process. […] This is often a frustrating process and you’ll need patience. You think that you’ve covered all the possibilities and your code runs for an entire week and you are all happy and celebratory and then on day #8 the damn thing crashes. The servers went down for a millisecond or your Netflix streaming clogged your internet connection or whatnot. It happens.

I ran into pretty similar problems when I tried to scrape the Kiesraad website. I tried many variations of the built-in wait parameters, but without any success. In the end I decided to write a few custom functions for the purpose.

The example below looks up the options of a dropdown menu. As long as the number of options isn’t greater than 1 (the page initially loads with only one option, a dash, and other options are loaded subsequently), it will wait a few seconds and try again - until more options are found or until a maximum number of tries has been reached.

MAX_TRIES = 15
 
def count_options(xpath, browser):
 
    time.sleep(3)
    tries = 0
    while tries < MAX_TRIES:
 
        try:
            element = browser.find_element_by_xpath(xpath)
            count = len(element.find_elements_by_tag_name('option'))
            if count > 1:
                return count
        except:
            pass
 
        time.sleep(1)
        tries += 1
    return count

Here’s a script that will download and save the result pages of all cities for the March 2017 Lower House election, parse the html, and store the results as a csv file. Run it from a subfolder in your project folder.

Notes

Dutch election results are provided by the Kiesraad as open data. In the past, the Kiesraad website used to provide a csv with the results of all the municipalities, but this option is no longer available. Alternatively, a download is available of datasets for each municipality, but at least for 2017, municipalities use different formats.

Scraping the Kiesraad website appears to be the only way to get uniform data per municipality.

Since I originally wrote the scraper, the Kiesraad website has been changed. As a result, it would now be possible to scrape the site in a much easier way, and there would be no need to use Selenium. The source code of the landing page for an election contains a dictionary with id numbers for all the municipalities. With those id numbers, you can create urls for their result pages. No clicking required.

Tags: 

The pay gap between CEO and workers

The Dutch Corporate Governance Code has been revised. As a result, Dutch listed companies have started to report their internal pay ratios. These pay ratios are often thought of as the gap between CEO pay and what other workers in the firm get paid.

How can you use these ratios? One way is to create a Fat Cat Calendar. The inspiration here is the British phenomenon of Fat Cat Day: the day when CEOs have earned as much as their employees will earn in an entire year.

The calendar shows that by 2 January at quarter to five in the afternoon, the Heineken CEO had earned more than his workers over all of 2017.

Some firms are missing from the calendar. For example, Euronext says it’s complicated to calculate a pay ratio, because they operate in different countries (but then, this would apply to most listed companies). Shell has calculated a pay ratio, but they don’t report how high it is - only how it compares to the pay ratios of a selection of other companies.

Note that a pay ratio is not a simple and straightforward figure. Below is an exploratory analysis of some of the issues involved (see Method section below for caveats).

Heineken

Heineken, which has by far the highest reported pay ratio in my sample (215), argues this is a consequence of their business model.

First, Heineken does a lot of business in ‘emerging markets with widely different pay levels and structures compared to the Netherlands and Europe’. In other words, many workers are in low-wage countries. The underlying suggestion seems to be that you don’t have to pay these workers the same wages as their colleagues in Europe.

Second, Heineken has ‘a large number of breweries and sales forces in-house worldwide, which adds to the variety of pay within the Company’ (the treatment of Heineken sales staff in Africa is controversial, but that’s a different matter). Aside from the question whether Heineken are paying their brewery and sales workers enough, they do have a point here.

Many companies have simply outsourced their low-paid workers. It’s a bit arbitrary to compare CEO pay to employees of the firm but not to the outsourced workers who clean their offices, serve their lunches, fix their computers, and manufacture and sell their products.

Third, Heineken says that pay ratios can be very volatile because of the substantial bonuses that go up and down. One might argue that this is not a problem of the pay ratio, but a problem of the composition of CEO pay.

Anyone can calculate a ratio

According to the Corporate Governance Code, firms should report the ‘ratio between the remuneration of the management board members and that of a representative reference group determined by the company’. This leaves ample room for firms to decide how to calculate the ratio. For that reason, comparing ratios between firms is a bit tricky.

In practice, many firms calculate the pay ratio as the ratio between total CEO pay and total staff costs per FTE. This is information that is normally included in the annual report, so you can calculate the pay ratio yourself.

I calculated pay ratios for a sample of companies listed in Amsterdam and compared these to the pay ratios these companies reported in their annual report. The chart below shows how my calculated pay ratios compare to the reported pay ratios.

In many cases, my calculated pay ratios are quite similar to the pay ratios reported by the companies. It’s interesting to look into some of the examples where this is not the case, and why this might be:

  • Some CEOs were appointed during 2017 and therefore weren’t paid an entire year’s remuneration. I didn’t correct for this, so my calculated pay ratio is too low in these cases. An example is AkzoNobel, which correctly reported a higher pay ratio than the one I calculated.
  • Randstad probably used only corporate employees and not ‘candidates’ (temp workers) to calculate their pay ratio. This would explain why they arrived at a smaller gap between CEO and workers than the one I calculated.
  • Other companies also use a subset of their employees for their calculation. Assuming these are relatively well-paid employees, the gap between CEO and workers will appear smaller that way. An example is OCI, which used only employees in Europe and North-America as a reference group. Unilever takes this approach one step further by comparing CEO pay to their various UK and Dutch management work levels. This resulted in a range of pay ratios, each far smaller than the one I calculated.
  • A few companies use the average remuneration of the entire executive board, rather than CEO remuneration, to calculate the pay ratio. Assuming the CEO earns more than the other board members, this will also make the gap with workers appear smaller. An example is AMG, which argues that using average board remuneration is appropriate ‘given the collective management responsibility of the Management Board members’ (this is somewhat ironic, for ‘collective responsibility’ was apparently not a key consideration when they decided to pay the CEO 50% more than the other board members).

Discussion

I think it would be fair to say that the requirement to report pay ratios is a failed attempt at transparency. Since firms can use whatever method they want to calculate the ratio, comparisons across firms are problematic. Meanwhile, anyone can calculate ratios from data that is already available. While this isn’t entirely unproblematic (see Method section below), the ratios you calculate will probably be more consistent across firms than the reported ratios.

Instead of requiring firms to report a pay ratio, it would make more sense to require them to report CEO pay, staff costs and staff numbers in a more consistent and transparent way.

Further, it’s somewhat arbitrary to compare CEO pay only to employees of the firm, and not outsourced workers such as cleaners. A fair measure of inequality should look beyond the employees of the firm. One option is to compare CEO pay to the median income in a country; another is the norm which states that CEO pay should not be higher than 20 times the legal minimum wage. Of course, a limitation of these approaches is that they don’t capture international inequality.

The highest CEO-to-minimum wage ratio for the firms in my sample is 577 (Unilever). By 1 January 2017 at quarter past three in the afternoon, the CEO of Unilever had earned a the minimum wage for an entire year. If you want to narrow this gap, there are broadly two ways to do it: show more restraint in CEO remuneration, and raise the minimum wage.

Method

I calculated ‘Fat Cat Day’ by simply adding one year, divided by the pay ratio, to 1 January 2017:

d1 = dt.datetime.strptime('2017-01-01', '%Y-%m-%d')
d2 = dt.datetime.strptime('2018-01-01', '%Y-%m-%d')
year = d2 - d1
 
dates = defaultdict(list)
for i, row in df.iterrows():
    if pd.notnull(row.ratio_reported_17):
        ratio = row.ratio_reported_17
        fcd = d1 + year / ratio
        date = dt.datetime.strftime(fcd, '%Y-%m-%d')
        time = dt.datetime.strftime(fcd, '%H:%M')
        company = row.company
        item = {
            'time': time,
            'company': company
        }
        dates[date].append(item)

Note that the British High Pay Centre has a far more elaborate method to calculate Fat Cat Day, which considers how many hours CEOs work, whether or not they work weekends, etcetera. While this is very conscientious, I don’t think it’s necessary. CEOs are paid on an annual basis, and if they work less than a full year, they’ll generally receive a (more or less) proportional share of their annual pay, regardless of the actual number of hours worked.

I googled for annual reports of companies listed on the Amsterdam stock exchange (in a few cases, I also looked up a separate remuneration report). I didn’t always find one, which can be for a number of reasons: perhaps I didn’t look hard enough; perhaps companies hadn’t filed their report yet; or perhaps they filed it with the company register but didn’t publish it online. As a quick filter, I disregarded reports that don’t contain the term pay ratio. All in all, this is a rather pragmatic sample, which may not be representative of all Amsterdam-listed companies (for example, it’s conceivable that firms with stronger roots in the Netherlands are more inclined to comply with the Corporate Governance Code). For exploratory purposes, I think that’s ok.

I calculated the pay ratio dividing total CEO pay by the average total staff costs per FTE. This may sound straightforward, but it isn’t:

  • I had to manually copy data from pdfs, so errors can’t be excluded;
  • Different methods may be used to calculate CEO pay (I used the total amount as reported by the company, without checking the method they used to calculate it);
  • It’s not always clear which categories of workers are included in staff costs and staff numbers;
  • If possible I used FTE for staff numbers, but sometimes only headcount is reported and some reports don’t specify what unit they used;
  • If possible I used the average number of staff, but sometimes only the number of staff at the end of the year is reported;
  • I didn’t annualise CEO pay for CEOs who were appointed during 2017. It might seem simple to do so (total pay * 365 / number of days worked), but in some cases CEO pay appears to contain elements that are not dependent on the number of days worked (e.g., the annual incentive at Philips Lighting).

All but two companies in my sample use EUR as presentation currency. I pragmatically used the average exchange rate for 2017 as reported by OCI to convert USD to EUR.

For the minimum wage, I used the average of the rates per 1 January and 1 July 2017, and added 8% holiday pay.

The ‘20 times minimum wage’ norm has been ascribed to trade union FNV, but that’s not technically correct.

The data I used can be found in this csv file. Please let me know if you find any errors.

Tags: 

Voter revolt in Amsterdam

In the 21 March municipal election, many Amsterdammers voted for new parties. The map below shows the effect this had on parties that already had seats on the city council. Red circles represent polling stations where the established parties lost; the rare green circles show where they won. The size of the circles corresponds to their gain or loss in percentage points.

Established parties lost across the city, but less so in Centrum and Zuid. The voter revolt was felt most in the peripheral districts Nieuw-West, Noord and Zuidoost, followed by parts of West and Oost. At some polling stations, support for the established parties declined by 15 to over 30 percentage points, with a peak of 43 percentage points.

The success of the new parties has been explained by ethnic background (especially DENK gained substantial ethnic minority support), but there’s also a socio-economic component. The chart below, showing results at the neighbourhood level, illustrates this. The share of votes for the new parties DENK, FvD, BIJ1 plus ChristenUnie is larger than the loss of the established parties, because they also won votes from parties that didn’t make the city council four years ago.

The new parties got their votes mainly in the less affluent neighbourhoods of Amsterdam, as measured by the average value of houses. This doesn’t have much impact on pro-market parties like VVD and D66, which get most of their votes in the richer parts of the city. For the social-democrat PvdA and socialist SP, things are different. The chart below shows what happened to their voters. Grey circles represent the 2014 election; red ones the situation in 2017/2018 (the scale on the y-axis is slightly different from the one above).

First, it should be noticed that most circles have moved to the right: the value of houses has increased significantly over the past years. This effect tends to be somewhat stronger in richer neighbourhoods. As a result, inequality has increased.

In 2014, PvdA and SP had considerable support in the less affluent neighbourhoods, but those are also the neighbourhoods where they lost most on 21 March. By now, their support there is hardly larger than in the richer neighbourhoods anymore. This effect is strongest for the PvdA (note that this effect doesn’t apply to other left-wing parties like GroenLinks).

Over the past years, concern has grown over Amsterdam’s social divide. The 21 March election outcome can be seen as a reflection of this inequality. In the less affluent peripheral neighbourhoods, established parties lost votes, as new parties grew.

The winner of the election, green party GroenLinks, has opted not to invite these new parties to the negotiations for a coalition agreement. In itself, there’s nothing wrong with that choice. Meanwhile, the new city government will need to come up with a credible answer to the city’s social divide. GroenLinks has often identified this as one of the key issues that need to be addressed.

For data sources and method, see the Dutch version of this article.

Tags: 

The impact of #deletefacebook

This is turning into a bit of a series: in previous posts, I showed how there’s a yearly peak in people googling “delete facebook” around New Year, the time for New Year’s resolutions. The peak is even higher than for “quit smoking”.

Against the backdrop of the latest Facebook controversy, Whatsapp co-founder Brian Acton helped launch a #deletefacebook campaign. Below is an update of my previous chart, which gives a preliminary impression of the impact of this campaign.

Some caution is in order, for recent Google Trends data can sometimes be a bit unstable. Also, it’s possible that currently some people are googling “delete facebook” out of curiosity, without actually intending to delete their account. That said, the impact of the current campaign may well be substantially larger than the annual New Year’s peak.

Tags: 

Search Facebook by date

Henk van Ess and Daniel Endresz have created a tool to search Facebook by date or date range. The tool creates a url containing the search criteria (as with Facebook Graph). It uses Javascript to generate the search urls. For example, this is how the date range url is generated:

function generate_url_timerange() {
 
    var keyword = $('#input-timerange-keyword').val();
 
    var day1 = $('#select-timerange-day1').val();
    var month1 = $('#select-timerange-month1').val();
    var year1 = $('#select-timerange-year1').val();
 
    var day2 = $('#select-timerange-day2').val();
    var month2 = $('#select-timerange-month2').val();
    var year2 = $('#select-timerange-year2').val();
 
    var url = 'https://www.facebook.com/search/str/'+keyword+'/stories-keyword/'+day1+'/'+month1+'/'+year1+'/date-3/'+day2+'/'+month2+'/'+year2+'/date-3/stories-2/intersect'
 
    $('#btn-search-timerange').attr('href', url);
}

The tool has been published with an open source license. The creators indicate that they «respect your privacy and the cases you are working on, so we are not storing any searches you will make» - which is nice, even if it would seem to be of little consequence since you need to be logged into Facebook to use the tool.

Pages