Academic support for Mélenchon, mapped

On Sunday, the first round of the French presidential election will be held. Left-wing candidate Jean-Luc Mélenchon has surged in the polls and rightwingers have called his programme devastating. On the other hand, over a hundred economic scientists have said he offers a serious and credible alternative to the destructive austerity policies of the past decades.

Given Mélenchon’s criticism of Germany’s economic policy and his support for Greece, one might expect academic support for his programme to be concentrated in the south of Europe. However, the map shows his academic supporters are also in countries like the UK and Germany.

Read more about Mélenchon’s programme here and here.

Method

I geocoded the affiliations of the list of supporters using this tool and Bing’s map api. Sometimes Bing gets the location of the institution right, sometimes it gives the location of the city where it’s located and sometimes it fails. I’ve corrected a few coordinates manually but I can’t rule out I missed any errors.

Tags: 

Reject all evidence: How George Orwell’s 1984 went viral last January

On Sunday 22 January 2017, Trump adviser Kellyanne Conway introduced the term alternative facts to justify disputed White House claims about how many people had attended Trump’s inauguration. The term alternative facts was quickly associated with the newspeak and doublethink of George Orwell’s novel Nineteen Eighty-Four. Sales of the book became ‘hyperactive’ during the following week.

I looked up some 150,000 tweets about Orwell’s ‘1984’ to see how interest in the novel developed during that week (note that analysing tweets is a somewhat messy business - see Method below for caveats).

But first, a basic timeline. On Friday 20 January, the inauguration took place. Afterwards, people started tweeting photos showing empty spots in the audience. On Saturday, the White House claimed the photos were misleading and that the inauguration had drawn the «largest audience to ever witness an inauguration». On Sunday, Conway appeared on NBC’s Meet the Press and defended the White House claim as alternative facts.

Alternative facts

The chart below shows tweets about Orwell’s 1984 and how many of those tweets specifically mention alternative facts. Immediately after Conway’s Meet the Press interview, the first tweets appeared that made the connection between alternative facts and 1984 (the green line in the chart). The real peak occured on Tuesday, when major media started to discuss the connection.

The alternative facts quote can explain some of the interest in ‘1984’, but there was also a peak in Orwell 1984 tweets even before the interview with Conway took place.

Amazon sales

Meanwhile, sales of the book ‘1984’ on Amazon started to rise. On Sunday, the day of the interview, it reached the top 20. On Tuesday, the Guardian reported it had reached number 6 and in the evening of that same day, it became the number 1 best-selling book on Amazon.

At some point, people started to discuss the rising book sales on twitter, as the chart below shows.

Tweets about sales of ‘1984’ didn’t really take off until Tuesday, and largely coincided with talk about the alternative facts quote.

Reject all evidence

That still leaves the question what the earlier Orwell 1984 tweets were about. Interestingly, almost all these earlier tweets contain the following quote from ‘1984’, which describes how the authorities redefine truth:

The Party told you to reject all evidence of your eyes and ears. It was their final, most essential command.

The chart below shows tweets containing this quote.

On Saturday evening, the White House had held its press conference at which it claimed a record number of people had attended the inauguration. The first reject all evidence tweet I could find was posted before that press conference, but the quote didn’t catch on until after the press conference. Within days, the quote was tweeted over 50,000 times.

In short, Conway’s remark on Sunday about alternative facts boosted interest in ‘1984’, but didn’t start it.

Meanwhile, the 1984 tweets probably reflect a broader phenomenon. Various media have discussed how dystopian novels like ‘1984’ are ‘chiming with people’ (get your reading list here).

Method

I used Python and the Tweepy library to search the Twitter API for orwell 1984. This method has limitations. Twitter provides a sample of all tweets and no-one knows exactly how much is missing from that sample. Further, searching for orwell 1984 may overlook tweets only mentioning orwell or 1984, or even nineteen eighty-four, as in the official book title.

The search for orwell 1984 yielded some 150,000 tweets. If the text contains both alternative and facts (this includes tweets containing #alternativefacts) I classified them as being about alternative facts; if they contain amazon or sales or bestseller or best-seller, I classfied them as being about sales. If they contain reject and evidence and eyes, I classified them as containing the quote «The Party told you to reject all evidence of your eyes and ears. It was their final, most essential command».

I used 9 am as the time at which Meet the Press was aired. For the time of the original White House claim about attendance at the inauguration, I used this recorded live feed which was announced to start at 4:30 pm; the actual press conference starts after about 1.5 hrs, i.e. 6 PM.

Tags: 

Python script to import .sps files

In a post about voting locations (in Dutch) I grumbled a bit about inconsistencies in how Statistics Netherlands (CBS) spells the names of municipalities and why don’t they include the municipality codes in their data exports. This afternoon, someone who works at CBS responded on Twitter. She had asked around and found a workaround: download the data as SPSS. Thanks!

CBS offers the option to download data as an SPSS syntax file (.sps). I wasn’t familiar with this filetype, I don’t have SPSS and I couldn’t immediately find a package to import this filetype. But it turns out that .sps files are just text files, so I wrote a little script that does the job.

Note that it’s not super fast; there may be more efficient ways to do the job. Also, I’ve only tested it on a few CBS data files. I’m not sure it’ll work correctly if all variables have labels or if the file contains not just data but also statistical analysis.

That said, you can find the script here.

Tags: 

How to automate extracting tables from PDFs, using Tabula

One of my colleagues needs tables extracted from a few hundred PDFs. There’s an excellent tool called Tabula that I frequently use, but you have to process each PDF manually. However, it turns out you can also automate the process. For those like me who didn’t know, here’s how it works.

Command line tool

You can download tabula-java’s jar here (I had no idea what a jar is, but apparently it’s a format to bundle Java files). You also need a recent version of Java. Note that on a Mac, Terminal may still use an old version of Java even if you have a newer version installed. The problem and how to solve it are discussed here.

For this example, create a project folder and store the jar in a subfolder script. Store the PDFs you want to process in a subfolder data/pdf and create an empty subfolder data/csv.

On a Mac, open Terminal, use cd to navigate to your project folder and run the following code (make sure the version number of the tabula jar is correct):

for i in data/pdf/*.pdf; do java -jar script/tabula-0.9.2-jar-with-dependencies.jar -n -p all -a 29.75,43.509,819.613,464.472 -o ${i//pdf/csv} $i; done

On Windows, open the command prompt, use cd to navigate to your project folder and run the following code (again, make sure the version number of the tabula jar is correct):

for %i in (data/pdf/*.pdf) do java -jar script/tabula-0.9.2-jar-with-dependencies.jar -n -p all -a 29.75,43.509,819.613,464.472 -o data/csv/%~ni.csv data/pdf/%i

The settings you can use are described here. The examples above use the following settings:

  • -n: stands for nospreadsheet; use this if the tables in the PDF don’t have gridlines.
  • -p all: look for tables in all pages of the document. Alternatively, you can specify specific pages.
  • -a (area): the portion of the page to analyse; default is the entire page. You can choose to omit this setting, which may be a good idea when the location or size of tables varies. On the other hand, I‘ve had a file where tables from one specific page were not extracted unless I set the area variable. The area is defined by coordinates that you can obtain by analysing one PDF manually with the Tabula app and exporting the result not as csv, but as script.
  • -o: the name of the file to write the csv to.

In my experience, you may need to tinker a bit with the settings to get the results right. Even so, Tabula will sometimes get the rows right but incorrectly or inconsistently identify cells within a row. You may be able to solve this using regex.

Python (and R)

There’s a Python wrapper, tabula-py that will turn PDF tables into Pandas dataframes. As with tabula-java, you need a recent version of Java. Here’s an example of how you can use tabula-py:

import tabula
import os
import pandas as pd
 
folder = 'data/pdf/'
paths = [folder + fn for fn in os.listdir(folder) if fn.endswith('.pdf')]
for path in paths:
    df = tabula.read_pdf(path, encoding = 'latin1', pages = 'all', area = [29.75,43.509,819.613,464.472], nospreadsheet = True)
    path = path.replace('pdf', 'csv')
    df.to_csv(path, index = False)

Using the Python wrapper, I needed to specify the encoding. I ran into a problem when I tried to extract tables with varying sizes from multi-page PDFs. I think it’s the same problem as reported here. From the response, I gather the problem may be addressed in future versions of tabula-py.

For those who use R, there’s also an R wrapper for tabula, tabulizer. I haven’t tried it myself.

Call tabula-java from Python

[Update 2 May 2017] - I realised there’s another way, which is to call tabula-java from Python. Here’s an example:

import os
 
pdf_folder = 'data/pdf'
csv_folder = 'data/csv'
 
base_command = 'java -jar tabula-0.9.2-jar-with-dependencies.jar -n -p all -f TSV -o {} {}'
 
for filename in os.listdir(pdf_folder):
    pdf_path = os.path.join(pdf_folder, filename)
    csv_path = os.path.join(csv_folder, filename.replace('.pdf', '.csv'))
    command = base_command.format(csv_path, pdf_path)
    os.system(command)

This solves tabula-py’s problem with multipage pdf’s containing tables with varying sizes.

Tags: 

Trick the trackers with a flood of meaningless data

A couple years ago, Apple obtained a patent for an intriguing idea: create a fake döppelganger that shares some characteristics with you, say birth date and hair colour, but with other interests - say basket weaving. A cloning service would visit and interact with websites in your name, messing up the profile companies like Google and Facebook are keeping of you.

I don’t think anyone has implemented it. But now I read at Mathbabe’s blog about a similar idea that actually has been implemented. It’s called Noiszy and it is

a free browser plugin that runs in the background on Jane’s computer (or yours!) and creates real-but-meaningless web data – digital «noise». It visits and navigates around websites from within the user’s browser, leaving your misleading digital footprints wherever it goes.

Cool project. However, it has been argued that the organisations that are tracking you can easily filter out the random noise created by Noiszy.

Tags: 

My failed attempt to build an interesting twitter bot

In 2013, I created @dataanddataviz: a twitter account that retweets tweets about data analysis, charts, maps and programming. Over time, I’ve made a few changes to improve it. And @dataanddataviz did improve, but I’m still not satisfied with it, so I decided to retire it.

There are all sorts of twitter bots. Often, their success is measured by how many followers they gain or how much interaction they provoke. My aim was different. I wanted to create an account that is so interesting that I’d want to follow it myself. Which, of course, is a very subjective criterion (not to mention ambitious).

Timing

First a practical matter: it has been suggested that you can detect twitter bots by the timing of their tweets. The chart below (inspired by this one) shows the timing of posts by @dataanddataviz.

I randomized the time at which @dataanddataviz posts. The median time between tweets is about 100 minutes (I lowered the frequency last January, as shown by the dark green dots). There is no day/night pattern. If tweets were posted manually, you’d expect a day/night pattern.

Selecting tweets

Initially, I collected tweets using search terms such as dataviz, data analysis and open data. From those tweets, I tried selecting the most interesting ones by looking how often they had been retweeted or liked. @dataanddataviz would retweet some of the more popular recent tweets.

This was not a success. For example, there are quite a few people who tweet conspiracy theories and include references to data and charts as «proof». Sometimes, their tweets get quite a few likes and retweets, and @dataanddataviz ended up retweeting some of those tweets. Awkward.

I decided to try a different approach: follow people who I trust, and use their retweets as recommendations. If someone I trust thinks a tweet is interesting enough to retweet, then it may well be interesting enough for @dataanddataviz to retweet.

The people who I follow tweet about topics like data and charts, but sometimes they tweet about other topics too. To make sure tweets are relevant, I added a condition that the text of the tweet contains at least one «mandatory» term (e.g. python, d3, or regex). I also added a condition that none of a series of «banned» terms was in the text of the tweet. I used banned terms for two purposes: filter out tweets about job openings and meetings (hiring, meetup) and filter out hypes (bigdata, data science).

This approach was a considerable improvement, but I still wasn’t happy. Sure, most of @dataanddataviz’ retweets now were relevant and retweets of embarrassing tweets became rare. But too few retweets were really good.

Predict quality?

I tried if I could predict the quality of tweets. I created an interface that would let me rate tweets that met the criteria described above: retweeted by someone I follow; containing at least one of the required terms and containing none of the banned terms.

The interface shows the text of the tweet and, if applicable, the image included in it, but not the names of the person who originally posted the tweet and the recommender who had retweeted it. This way, I forced myself to focus on the content of the tweet rather than the person who posted it. Rating consisted in deciding whether I would retweet that tweet.[1]

I rated 1095 tweets that met the basic criteria. Only 130 were good enough for me to want to retweet them. That’s not much.

I looked if there are any characteristics that can predict whether a tweet is - in my eyes - good enough to retweet. For example: text length; whether the text contains a url, mention or hashtag; and characteristics of the person who originally posted the tweet, such as account age; favourites count; followers count; friends count; friends / followers ratio; statuses count and listed count. None of these characteristics could differentiate between OK tweets and good tweets.

I also looked whether specific words are more likely to appear in good tweets - or vice versa. This was the case, but most examples are unlikely to generalise (e.g., good tweets were more likely to contain the word air or #demography).

Conclusion

I didn’t succeed in creating a retweet bot I’d want to follow myself. @dataanddataviz’ retweets are generally OK but only occasionally really good.

Also, I couldn’t predict tweet quality. Perhaps it would make a difference if I used a larger sample, or more advanced analytical techniques, but I doubt it. Subjective quality appears to be difficult to predict - which shouldn’t come as a big surprise (in fact, Twitter itself isn’t very good at predicting which tweets I’ll like, judging by their You might like suggestions).

Meanwhile, I found that since November, more of the tweets retweeted by @dataanddataviz tend to have a political content. Retweeting political statements isn’t something I want to delegate to a bot, so that’s another reason to retire @dataanddataviz.


  1. Obviously, what is being measured is a bit complicated. Whether I’d want to retweet a tweet depends not only on its quality, but also on its subject. For example, I’m now less inclined to retweet tweets about R than I was a couple years ago, because I started using Python instead of R.  ↩

Tags: 

Quitting Facebook

Last month, data scientist Vicki Boykis posted an interesting article about the kind of data Facebook collects about you. It’s one of those articles that make you think: I really should delete my Facebook account - and then you don’t.

One could argue that Google search data illustrates how people relate to Facebook. People know Facebook isn’t good for them, but they can’t bring themselves to quit. However, when it’s time for New Year’s resolutions, they start googling how to delete their account.

UPDATE - Vicki Boykis just suggested to label major news events. In the past Google Trends had a feature that did just that, but I think they killed it. Of course, you can still do Google or Google News searches for a particular period. As a start I added two stories that may have contributed to the mid–2014 peak. Let’s see if other people come up with more.

Method

Note that the Google search data is per week so each data point really refers to the week starting at that date.

I wanted to do a chart like this in December last year, which would perhaps have been a more appropriate moment. However, I didn’t get consistent data out of Google Trends using search terms like quit facebook. The other day, after deleting my own Facebook account, I realised I had probably used the wrong search term. People don’t search for quit facebook but more likely for delete facebook - they’re looking for technical advice on how to delete their account.

Tags: 

New Python package for downloading and analysing street networks

stationsplein

The image above shows square mile diagrams of cyclable routes in the area around the Stationsplein in Amsterdam, the Hague, Rotterdam and Utrecht. I made the maps with OSMnx, a Python package created by Geoff Boeing, a PhD candidate in urban planning at UC Berkeley (via).

Square mile diagrams are a nice gimmick (with practical uses), but they’re just the tip of the iceberg of what OSMnx can do. You can use it to download administrative boundaries (e.g. the outline of Amsterdam) as well as street networks from Open Street Map. And you can analyse these networks, for example: assess their density, find out which streets are connections between separate clusters in the network, or show which parts of the city have long or short blocks (I haven’t tried doing network measure calculations yet).

Boeing boasts that his package not only offers functionality that wasn’t (easily) available yet, but also that many tasks can be performed with a single line of code. From what I’ve seen so far, it’s true: the package is amazingly easy to use. All in all, I think this is a great tool.

Amsterdam’s most irritating traffic light is at the Middenweg

Amsterdam’s most irritating traffic light is at the crossing of Middenweg and Wembleylaan, according to a poll among cyclists. The Amsterdam branch of cyclists’ organisation Fietsersbond says the top 10 most irritating traffic lights are well-known problem sites.

Comments made by participants in the poll show that cyclists are not just annoyed about long delays; they are also concerned about safety, especially at locations where many (school) children cross the street. Some cyclists nevertheless keep their spirits up: Plenty of time for an espresso there!!

Red and orange dots show locations of irritating traffic lights. If any comments have been submitted, the dot is red. Click on a red dot, or type a few letters below, to see comments about a particular crossing (comments are mostly in Dutch).

Here are the ten most irritating traffic lights:

  1. Middenweg / Wembleylaan
  2. Amstelveenseweg / Zeilstraat
  3. Middenweg / Veeteeltstraat
  4. Rozengracht / Marnixstraat
  5. Meer en Vaart / Cornelis Lelylaan Nz
  6. IJburglaan / Zuiderzeeweg
  7. mr Treublaan / Weesperzijde
  8. Frederiksplein / Westeinde
  9. Nassauplein / Haarlemmerweg
  10. Van Eesterenlaan / Fred Petterbaan

Some are at routes where the city gives priority to car circulation, at the expense of cyclists and pedestrians. However, cyclists say they frequently have to wait at red lights even though the crossing is empty. This could be a result of budget cuts on maintenance of the systems that detect waiting cyclists.

Quite a few cyclists complained about cars running red lights (perilous!) or blocking the crossing. Further, not everybody is happy with crossings where all cyclists simultaneously get a green light. Such a set-up is nice if you have to make a left turn, for it will spare you having to wait twice, but it may result in chaos.

The Fietsersbond wants traffic lights adjusted to create shorter waiting times for cyslists and pedestrians. Research by DTV consultants found that adjusting traffic lights is a simple and cheap way to improve the circulation of cyclists, and that it also improves safety.

An analysis of location data from cyclists’ smart phones found that there are traffic lights in Amsterdam where the average time lost exceeds 30 seconds.

Thank you to the Fietsersbond and to Eric Plankeel for their input; and to all cyclists who participated in the poll.

Tags: 

How much delay for cyclists is caused by traffic lights

Road segments near traffic lights

The other day I posted an article on how much time cyclists lose at traffic lights in Amsterdam. Someone asked if I can calculate what percentage of total time lost by cyclists is caused by traffic lights. Keep in mind that delays can be caused by traffic lights, but also by crossings without traffic lights, crowded routes and road surface.

Here’s an attempt to answer the question, although I must say it’s a bit tricky. Again, I’m using data from the Fietstelweek (Bicycle Counting Week), during which over 40,000 cyclists shared their location data. This time I’m using the data about links (road segments). For each link, they provide the number of observations, average speed and relative speed.

With this data, it should be possible to estimate what share of total delays occurs near traffic lights. But what is near? It’s to be expected that the effect of traffic lights is observable at some distance: people slow down while approaching a traffic light and it takes a while to pick up speed again after. But what threshold should you use to decide which segments are near a traffic light?

One way to address this is to look at the data. I created a large number of subsets of road segments that are within increasing distances from traffic lights, and calculated their average speed. For example, segments that are within 50m from a traffic light have an average speed of about 16 km/h. The larger group of segments that are within 150m have an average speed of about 17 km/h.

Judging by the chart, it appears that the effect of traffic lights is diminishing beyond, let’s say, 150m. You could use this as a threshold and then calculate that delays near traffic lights constitute nearly 60% of all delays.

However, there’s a problem. Even if a delay occurs within 150m of a traffic light, the traffic light will not always be the cause of that delay. I tried to deal with this by estimating a net delay, which takes into account how much delay normally occurs when cyclists are not near a traffic light (in fact, I used two methods, that have quite similar outcomes). Using this method, it would appear that over 20% of delays are caused by traffic lights.

Now, I wouldn’t want to make any bold claims based on this: these are estimates based on assumptions and simplifications (in fact, if you think there’s a better way to do this I’d be interested). That said, I think it’s fair to say that average bicycle speeds appear to be considerably lower near traffic lights and that it’s plausible that this may be the cause of a substantial share of delays for cyclists.

UPDATE - I realise that the way I wrote this down sort of implies that you could reduce delay for cyclists by perhaps 20% just by removing traffic lights, but that would of course be a simplification.

Method

I used Qgis to process the Fietstelweek data. I used the clip tool to select only road segments in Amsterdam. I had Qgis calculate the distance of each segment and extract the nodes, which I needed to get the coordinates of the start and end points. Further processing was done with Python.

The dataset contains a relative speed variable (it is capped at 1, which means that it only reflects people cycling slower than normal, not faster). A relative speed of 0.8 would mean that people cycle at 80% of their normal speed. I calculated total delay at segments this way:

number of observations * (1 - relative speed) * distance / speed

You can then calculate delay at segments near traffic lights, as a percentage of the sum of all delays.

I tried to get an idea of how much of delay is actually caused by traffic lights, by estimating net delay. For this, I needed net relative speed. I used two methods to estimate this: 1. divide the relative speed of a segment by the median relative speed of all segments that are not near a traffic light; and 2. divide the speed of a segment by the median speed of all segments that are not near a traffic light.

Python code here.

Pages