howTo

How to investigate assets: lessons from The Wire

I’m rewatching The Wire. It’s a great series anyhow, but for researchers, episode 9 of the first season (2002) is especially interesting. It features detective Lester Freamon instructing detectives Roland Pryzbylewski and Leander Sydnor how to investigate the assets of drug kingpin Avon Barksdale.

They use microfilm instead of the Internet. They don’t have databases like Orbis, Companyinfo or OpenCorporates, and they don’t seem to calculate social network metrics. Yet the general principles behind Freamon’s methodology still make perfect sense today:

Start with the nightclub that Barksdale owns. Look up Orlando’s, by address, you match it, and you see it’s owned by - who?

Turns out it’s owned by D & B Enterprises. Freamon tells Prez to take that information to the state office buildings on Preston Street.

Preston Street?

Corporate charter office.

Corporate who?

They have the paperwork on every corporation and LLC licensed to do business in the state. You look up D & B Enterprises on the computer. You’re going to get a little reel of microfilm. Pull the corporate charter papers that way. Write down every name you see. Corporate officers, shareholders or, more importantly, the resident agent on the filing who is usually a lawyer. While they use front names as corporate officers, they usually use the same lawyer to do the charter filing. Find that agent’s name, run it through the computer, find out what other corporations he’s done the filing for, and that way we find other front companies.

This is pretty much the same approach you’d take when investigating shady temp agencies: trace connections via (former) shareholders, board members, company addresses and related party transactions. And, of course, try to figure out where the profits go.

On that aspect, Freamon also has some wisdom to share:

And here’s the rub. You follow drugs, you get drug addicts and drug dealers. But you start to follow the money, and you don’t know where the fuck it’s gonna take you.

Adressen koppelen aan wijken of buurten

Het CBS heeft uitgebreide gegevens (selecteer thema Nederland regionaal) op wijk- en buurtniveau. Soms kan het handig zijn om adressen te koppelen aan de bijbehorende wijk of buurt. Daar komt wel wat bij kijken: adressen geocoden (of coördinaten opzoeken bij het Kadaster) en vervolgens die lokaties weer koppelen aan de shapefile van de Kernkaart wijken en buurten.

Maar dat kan nu ook sneller en simpeler. Het CBS heeft een tabel vrijgegeven waarin elke combinatie van postcode en huisnummer wordt gekoppeld aan de bijbehorende buurt en wijk.

Converting Election Markup Language (EML) to csv

Note that the map above isn’t really a good illustration here because I used a different data source to create it.

Getting results of Dutch elections at the municipality level can be complicated, but what if you want to dig a little deeper and look at results per polling station? Or even per candidate, per polling station? For elections since 2009, that information is available from the data portal of the Dutch government.

Challenges

The data is in Election Markup Language, an international standard for election data. I didn’t know that format and processing the data posed a bit of a challenge. I couldn’t find a simple explanation of the data structure, and the Electoral Board states that it doesn’t provide support on the format.

For example, how do you connect a candidate ID to their name and other details? I think you need to identify the Kieskring (district) by the contest name of the results file. Then, find the candidate list for the Kieskring and look up the candidate’s details using their candidate ID and affiliation. But with municipal elections, you have to look up candidates in the city’s candidate list (which doesn’t seem to have a contest name).

Practical tips

If you plan to use the data, here are some practical tips:

  • Keep in mind that locations and names of polling stations may change between elections.
  • If you want to geocode the polling stations, the easiest way is to use the postcode, which is often added to the polling station name (only for recent elections). If the postcode is not available or if you need a more precise location, the lists of polling station names and locations provided by Open State (2017, 2018) may be of use. Use fuzzy matching to match on polling station name, or perhaps you could also match on postcode if available. Of course, such an approach is not entirely error-free.

Further, note that the data for the 2017 Lower House election is only available in EML format for some of the municipalities. I guess this has something to do with the fact that prior to the election, vulnerabilities had been discovered in software to count the votes, so they had to count the votes manually.

Python script

Here’s a Python script that converts EML files to csv. See caveats there.

UPDATE 23 February 2019 - improved version of the script here.

How to use Python and Selenium for scraping election results

A while ago, I needed the results of last year’s Lower House election in the Netherlands, by municipality. Dutch election data is available from the website of the Kiesraad (Electoral Board). However, it doesn’t contain a table of results per municipality. You’ll have to collect this information from almost 400 different web pages. This calls for a webscraper.

The Kiesraad website is partly generated using javascript (I think) and therefore not easy to scrape. For this reason, this seemed like a perfect project to explore Selenium.

What’s Selenium? «Selenium automates browsers. That’s it!» Selenium is primarily a tool for testing web applications. However, as a tutorial by Thiago Marzagão explains, it can also be used for webscraping:

[S]ome websites don’t like to be webscraped. In these cases you may need to disguise your webscraping bot as a human being. Selenium is just the tool for that. Selenium is a webdriver: it takes control of your browser, which then does all the work.

Selenium can be used with Python. Instructions to install Selenium are here. You also have to download chromedriver or another driver; you may store it in /usr/local/bin/.

Once you have everything in place, this is how you launch the driver and load a page:

from selenium import webdriver
 
URL = 'https://www.verkiezingsuitslagen.nl/verkiezingen/detail/TK20170315'
 
browser = webdriver.Chrome()
browser.get(URL)

This will open a new browser window. You can use either xpath or css selectors to find elements and then interact with them. For example, find a dropdown menu, identify the options from the menu and select the second one:

XPATH_PROVINCES = '//*[@id="search"]/div/div[1]/div'
element = browser.find_element_by_xpath(XPATH_PROVINCES)
options = element.find_elements_by_tag_name('option')
options[1].click()

If you’d check the page source of the web page, you wouldn’t find the options of the dropdown menu; they’re added afterwards. With Selenium, you needn’t worry about that - it will load the options for you.

Well, actually, there’s a bit more to it: you can’t find and select the options until they’ve actually loaded. Likely, the options won’t be in place initially, so you’ll need to wait a bit and retry.

Selenium comes with functions that specify what it should wait for, and how long it should wait and retry before it throws an error. But this isn’t always straightforward, as Marzagão explains:

Deciding what elements to (explicitly) wait for, with what conditions, and for how long is a trial-and-error process. […] This is often a frustrating process and you’ll need patience. You think that you’ve covered all the possibilities and your code runs for an entire week and you are all happy and celebratory and then on day #8 the damn thing crashes. The servers went down for a millisecond or your Netflix streaming clogged your internet connection or whatnot. It happens.

I ran into pretty similar problems when I tried to scrape the Kiesraad website. I tried many variations of the built-in wait parameters, but without any success. In the end I decided to write a few custom functions for the purpose.

The example below looks up the options of a dropdown menu. As long as the number of options isn’t greater than 1 (the page initially loads with only one option, a dash, and other options are loaded subsequently), it will wait a few seconds and try again - until more options are found or until a maximum number of tries has been reached.

MAX_TRIES = 15
 
def count_options(xpath, browser):
 
    time.sleep(3)
    tries = 0
    while tries < MAX_TRIES:
 
        try:
            element = browser.find_element_by_xpath(xpath)
            count = len(element.find_elements_by_tag_name('option'))
            if count > 1:
                return count
        except:
            pass
 
        time.sleep(1)
        tries += 1
    return count

Here’s a script that will download and save the result pages of all cities for the March 2017 Lower House election, parse the html, and store the results as a csv file. Run it from a subfolder in your project folder.

UPDATE 23 February 2019 - improved version of the script here.

Notes

Dutch election results are provided by the Kiesraad as open data. In the past, the Kiesraad website used to provide a csv with the results of all the municipalities, but this option is no longer available. Alternatively, a download is available of datasets for each municipality, but at least for 2017, municipalities use different formats.

Scraping the Kiesraad website appears to be the only way to get uniform data per municipality.

Since I originally wrote the scraper, the Kiesraad website has been changed. As a result, it would now be possible to scrape the site in a much easier way, and there would be no need to use Selenium. The source code of the landing page for an election contains a dictionary with id numbers for all the municipalities. With those id numbers, you can create urls for their result pages. No clicking required.

Embedding tweets in Leaflet popups

I just created a map showing where so-called Biro’s (small cars) are parked on the pavement and annoying people. Twitter has quite a few photos of the phenomenon. In some cases, finding their location took a bit of detective work.

First you’ll need the embed code for the tweet. You can get it manually from the Twitter website, but if you want to automate your workflow, use a url like the one below. It’ll download a bit of json containing the embed code:

https://publish.twitter.com/oembed?url=https://twitter.com/nieuwsamsterdam/status/958761072214896640

When trying to embed the tweets in Leaflet popups, I ran into a few problems:

  • When popups open, the markers didn’t properly move down. As a result, most of the popup would be outside the screen. The problem and how to solve it are described here.
  • Twitter embed code contains a script tag to load a widget. Apparently you can’t execute javascript by adding it directly to the html for the popup content, but you can add it using a selector (cf here).

Here’s the code that’ll solve both problems:

map.on('popupopen', function(e) {
    $.getScript("https://platform.twitter.com/widgets.js");
    var px = map.project(e.popup._latlng); 
    px.y -= e.popup._container.clientHeight;
    map.panTo(map.unproject(px),{animate: true});
});

You may also want to do something about the width of the popups, because otherwise they will obscure most of the map on mobile screens and it will be difficult to close a popup (which you can normally do by clicking outside of the popup). You can change the width of embedded tweets, but this will not change the width of the popup itself. A simple solution is to give popups a maxWidth of 215 (.bindPopup(html, {maxWidth: 215})).

Of course, you could also vary maxWidth depending on screen width, but I think 215px works well on all screens. Further, embedded tweets appear to have a minimum width of about 200px, so if you want popups narrower than 215px you’ll have to figure out a way to fix that.

If you embed tweets, Twitter can track people who visit your webpage. Add <meta name="twitter:dnt" content="on"> to your page and Twitter promises they won’t track your visitors. I wasn’t sure whether this should be put in the web page itself or in the html content of the popups (I opted for both).

If the popups have a somewhat spartan look and do not contain photos: Good for you! You’re probably using something like Firefox with tracking protection enabled. This blocks sites which have been identified as ‘engaging in cross-site tracking of users’ - including, apparently, platform.twitter.com.

Pages