Data analysis course

Cool, I’ve got my statement of accomplishment for the Data Analysis course. It was a lot of work and I learned a lot, but there’s still a lot more to learn and I might actually re-enroll if they offer this course again.

US Congress’ interest in the world: the role of elections, trade and oil

Number of times countries are mentioned by year

Codeyear offers a course on how to use the API of the Sunlight Foundation to search transcripts of the US Congress. I used this approach to find out how often foreign countries are discussed in Congress. A simple inspection of the total frequencies suggests two conclusions:

  • Interest in foreign countries rose under the ‘Bush doctrine’ and fell since the start of the current economic crisis;
  • There are often peaks during odd years. Plausibly, Congress focuses more on domestic issues in even years, when there are elections for Congress.

Of course, the pattern may be different for individual countries (use the selector under the line graph to see data on individual countries). For example, interest in Afghanistan took off after 9/11; for Hong Kong it peaked in 1997 (transfer of sovereignty); for Serbia, Kosovo and Albania in 1999 (NATO bombing campaign); for Tunisia in 2011 (Arab Spring) and interest in Austria took off in 2009 (well that’s actually a mistake: in 2009, somebody named Steve Austria joined the US House of Representatives, boosting the number of times the term ‘Austria’ appears in transcripts).

Number of times countries are mentioned by population, GDP and trade

The scatterplots illustrate how the total number of times a country has been discussed in Congress over the period 1996-2012 is associated with population size, GDP and the amount of trade between that country and the US (note that the scales are log scales, a feature of D3.js; unfortunately I didn’t manage to get readable values on the x axis). Population, GDP and trade are correlated, so figuring out what exactly drives US Congress interest in a country remains an interesting challenge.

Interest in countries is also related to the presence of natural resources: for countries without oil, the median number of times they were discussed in Congress is 331; for countries with oil it is 900.

Of course this is just an exploratory analysis. An analysis at country/year level might yield more specific conclusions. If you want to do your own analysis, download the data here (country/year) and here (country).


I searched transcripts of the US Congress using the Capitol Words api of the Sunlight Foundation, using country names as search terms. Of course, this method isn’t perfect. I had to remove country names that can’t be distinguished from names of US states (Georgia, Mexico). Afterwards, I realised that I should also have removed Austria, because of confusion with a representative with that name.

Because this is just an exploratory analysis, I took a rather pragmatic approach to selecting background information on countries. For GDP, I used data from the World Bank; data on population, trade (2009) and oil reserves are from Wikipedia. For the scatterplots, I removed countries with incomplete data.



Paste0 in R is one of the things that we learned about in this week’s videos for the Data Analysis course. I didn’t think much of it at the time, but I was wrong! I just learned about statistical computing’s most influential contribution of the 21st century!

Blind followers on Twitter

On 30 september, I posted the last article on Nieuws uit Amsterdam (News from Amsterdam). The website has been inactive since, apart from a message on 28 October formally announcing that the site is no longer active. As expected, the number of new followers of @nieuwsamsterdam on twitter dropped in October. Intriguingly, it started to rise again after that.

The list of new followers has been compiled from 'You have new followers' emails and may be incomplete.