champagne anarchist | armchair activist

Python

Identifying «communists» at the New York Times, by 1955 US Army criteria

A while ago, Open Culture wrote about a 1955 US Army manual entitled How to spot a communist. According to the manual, communists have a preference for long sentences and tend to use expressions like:

integrative thinking, vanguard, comrade, hootenanny, chauvinism, book-burning, syncretistic faith, bourgeois-nationalism, jingoism, colonialism, hooliganism, ruling class, progressive, demagogy, dialectical, witch-hunt, reactionary, exploitation, oppressive, materialist.

What happened in the 1950s is pretty terrible, but that doesn’t mean we can’t have a bit of fun with the manual. I used the New York Times Article Search API to look up which of its writers actually use terms like hootenanny, book-burning and jingoism. The results are summarised below.

Interestingly, many of the users of «communist» terms are either foreign correspondents or art, music and film critics. While it’s possible that people who have an affinity with the arts tend to sympathise with communism, an alternative explanation would be that critics have more freedom than «regular» journalists to use somewhat exotic and expressive terms like the ones the US Army associated with communism.

Also of interest is that one of the current writers on the list is Ross Douthat, the main conservative columnist of the New York Times. In his articles, he uses terms like materialist, oppressive, reactionary, exploitation, vanguard, ruling class, progressive and chauvinism. Surely he wouldn’t be a reformed communist - would he?

Method

The New York Times Article Search API is a great tool, but you have to keep in mind that digitising the archive isn’t an entirely error-free process. For example, sometimes bits of information end up in the lastname field that don’t belong there (e.g. "lastname": "DURANTYMOSCOW"). While it’s possible to correct some of these issues, it’s likely that search results will in some way be incomplete.

To get a manageable dataset, I looked up all articles containing any combination of two terms from the manual. I then calculated a score for each author by simply counting the number of unique terms they have used.

An alternative would have been to correct for the total number of articles per author in the NYT archive. It took me a while to figure out how to search by author using the NYT API. It turns out you can search for terms appearing in the byline using ?fq=byline:("firstname middlename lastname") - even though this option isn’t mentioned in the documentation. I’m not entirely sure such a search will return articles where the byline/original field is empty.

As you might expect, there’s a correlation between the number of articles per author and the number of unique terms this author has used.

All in all, it would be possible to calculate a relative score, for example number of terms used per 1,000 articles, but this may have unintended consequences. To take an extreme example: an author who has written one article which happened to contain three terms would get a score of 3,000 using this method, whereas an author who has thousands of articles and consistently uses a broad range of terms but not at a rate of three per article would get a (considerably) lower score.

I decided to stick with the absolute number of unique terms per author. This has the disadvantage that authors who have written few articles are unlikely to show up in the analysis, but I’m not sure that this problem can be adequately solved by calculating a relative score.

The Python and R code used to collect and analyse the data is available on Github.

Connections between businesses and politics: banks and Shell dominate

Website Follow the Money has analysed the «revolving door» between politics and businesses in the Netherlands, adding that the examples discussed are far from exhaustive. I’ve expanded the list of connections between businesses and politics by checking the resumes of close to 700 politicians – government members and members of parliament – who have been active in Dutch politics after 2001.

The list is headed by the Rabobank: 32 politicians have (had) a position there. This score can perhaps partly be explained by the fact that Rabobank is a cooperative of local banks, each with their own advisory board; so many people have positions there. Number two is Royal Dutch Shell, the largest Dutch company (of course, it’s partly British).

From the list, it can be concluded that financial institutions play a central role in the connections between businesses and politics. The phenomenon is not politically neutral: almost three-quarters of the politicians who have (had) positions with the three largest banks are (or have been) affiliated to the conservative parties CDA and VVD.

One of them is former finance minister Gerrit Zalm (VVD). After his political career, he first moved to DSB Bank and then became chairman of the board of ABN Amro (for controversies, see the FTM article as well as this analysis by de Correspondent). Another example is Joop Wijn (CDA) who started at ABN Amro and subsequently served as minister and state secretary at the finance and economic affairs departments. After that, he had a management position at Rabobank and currently he’s on the executive board of ABN Amro.

Financial institutions aside, an interesting case is airline KLM, now part of Air France-KLM, which appears to have played a bit of an emancipatory role. Over the past years, as many as four former KLM stewardesses have obtained a position in national politics: Fransje Roscam Abbing-Bos (VVD, Senate); Gonny van Oudenallen (various parties, Lower House); Ing Yoe Tan (PvdA, Senate) and Kathleen Ferrier (CDA, Lower House).

Method

I’ve created a list of Dutch companies using information from Wikipedia and Elsevier / Bureau van Dijk. I’ve checked these companies against resumes from the (very useful) website Parlement.com. Here’s the Python script I used to download the resumes and to analyse them. The results had to be cleaned up manually. For example, former MP Wijnand Duyvendak, who’s been in charge of the Friends of the Earth Schiphol campaign, should not be counted as having had a position with Schiphol. To be on the safe side, I also didn’t count positions on the pension board or the board of a foundation of a company.

Verwevenheid politiek en bedrijfsleven: banken en Shell lopen voorop

Website Follow the Money heeft een analyse gemaakt van de «draaideur» tussen politiek en bedrijfsleven, met de toevoeging dat de besproken voorbeelden slechts het topje van de ijsberg vormen. Ik heb de lijst met connecties tussen bedrijven en politiek aangevuld door de cv’s op te zoeken van bijna 700 politici die na 2001 in een regering of in de Kamer hebben gezeten.

De lijst wordt aangevoerd door de Rabobank: maar liefst 32 politici hebben hier een positie (gehad). Deze score kan wellicht voor een deel worden verklaard door het feit dat de Rabobank een cooperatie is van lokale banken, elk met hun eigen toezichtsraad – daardoor zijn er veel mensen met een functie bij de Rabobank. Op nummer twee staat Shell, het grootste Nederlandse bedrijf (uiteraard is het deels Brits).

Uit de lijst blijkt dat financiële instellingen vooroplopen als het gaat om de verwevenheid tussen bedrijven en politiek. Vooral conservatieve partijen zijn hierbij betrokken: bijna driekwart van de politici die een positie hebben (gehad) bij de grote drie banken is lid (geweest) van CDA of VVD.

Eén van hen is voormalig minister van financiën Gerrit Zalm (VVD), die na zijn ministerschap overstapte naar de DSB Bank en later bestuursvoorzitter werd van ABN Amro (voor controverses zie het FTM artikel en ook deze analyse van de Correspondent). Een ander voorbeeld is Joop Wijn (CDA) die begon bij ABN Amro, vervolgens onder meer actief was als staatssecretaris en minister op de ministeries van financiën en economische zaken, om daarna weer functies te vervullen in de top van de Rabobank en ABN Amro.

Er staan niet alleen financiële instellingen op de lijst. Een interessant geval is de KLM, die een soort emancipatierol lijkt te hebben gespeeld. De afgelopen jaren hebben maar liefst vier voormalige KLM-stewardessen een positie in de landelijke politiek bereikt: Fransje Roscam Abbing-Bos (VVD, Eerste Kamer); Gonny van Oudenallen (verschillende partijen, Tweede Kamer); Ing Yoe Tan (PvdA, Eerste Kamer) en Kathleen Ferrier (CDA, Tweede Kamer).

Methode

Ik heb een lijst gemaakt met Nederlandse bedrijven op basis gegevens van Wikipedia en Elsevier / Bureau van Dijk. Vervolgens heb ik opgezocht hoe vaak deze bedrijven voorkomen in de cv’s van politici op de (erg nuttige) site Parlement.com. Hier is de Python code waarmee ik de cv’s heb gedownload en geanalyseerd. De resultaten moesten handmatig worden opgeschoond. Om een voorbeeld te noemen: voormalig Kamerlid Wijnand Duyvendak is campagneleider Schiphol geweest bij Milieudefensie; deze baan moet niet worden meegeteld als een positie bij Schiphol. Voor de zekerheid heb ik ook posities bij pensioenfondsen en foundations van bedrijven niet meegeteld.

Spamming after all? Revisiting the repost ratios of Vox, Upshot and 538

Recently I wrote about people who share their URLs on Twitter, and then post them again, hoping to draw even more people to their site. I said that FiveThirtyEight reposts its URLs on average 0.3 times. I was wrong: it reposts its URLs far more often. And so do voxdotcom and UpshotNYT, who didn’t even make the top 5 in my original analysis. The Upshot reposts its URLs on average as many as 0.8 times.

The reason I underestimated the repost ratios in my original analysis has to do with the fact that tweets tend to contain shortened URLs. http://nyti.ms/1rFwue2 and http://nyti.ms/1iIujpo look like different URLs. However, they point to the same article, so one should be treated as a repost of the other (or perhaps both are a repost of yet another one, who knows). If you don’t take this into account and treat them as different URLs, you’ll underestimate the number of reposts (red bar in the graph).

It’s not that I wasn’t aware of this problem when I did the first analysis. I first tried to account for this by looking up the non-shortened URLs, using the Python urllib2 module. It turned out this was very time-consuming, which was a problem since I wanted to look up quite a few URLs. Pragmatically, I decided instead to use the ‘expanded URL’ provided by the Twitter API. This method does yield higher repost ratios for 538 and the Upshot (grey bars in the graph). Still, it doesn’t really solve the problem, because the expanded URL provided by the Twitter API will sometimes be yet another shortened URL. That’s the reason I still underestimated how often people recycle their content on Twitter.

When I realised the ratios I had originally calculated were still rather low given how many reposts there appeared to be in my timeline, I decided to recalculate repost ratios using urllib2 after all. Because this method is so time-consuming, I did this for just three accounts: Vox, 538 and Upshot NYT. This resulted in repost ratios that are substantially higher (light blue bars in the graph). The new Python script is here.

Note that the ratios are snapshots calculated on a sample of the 200 most recent tweets (that is, about one to two weeks of tweets).

Not ditching R for Python just yet

As a result of the whole controversy over using Python vs R for statistical analysis and graphs, I thought I’d switch to Python. Mostly because I think it’s more practical to use the same language for different tasks, but also because it seems easier to make decent-looking graphs with Python (I’m sure some people will thoroughly disagree). And, of course, because googling for solutions using «Python» as a search term simply works better than searching for «R».

But now Brian Caffo, Roger Peng and Jeff Leek’s Data Science Specialization Course has started on Coursera and they use R. I guess I’ll have to postpone my decision.

Pages