RStats

Kilts or inequality

On 18 September, Scotland may vote for independence. My understanding is that the referendum isn’t necessarily about kilts and haggis, but rather about left-leaning Scots who are fed up with London’s neoliberal policies. Policies that have caused, among other things, a widening gap between the rich and the rest of society. In fact, the Scottish referendum has been called the «world’s first vote on economic inequality».

One way in which inequality manifests itself is geographically. An interesting question is whether income and political power coincide. In some countries such as Germany and the Netherlands, the seat of government is in a region with a GDP comparable to the rest of the country. More often, governments are in high-income regions. For example, France’s richest region is Hauts-de-Seine (with business district la Défense), followed on its heels by Paris itself. Both have a GDP almost three times as high as the national GDP.

But the widest gap is to be found in the UK. Across Europe, only three out of 1357 regions have a GDP per inhabitant that is more than 3 times as high as their national GDP. For Polish boomtown Warsaw, the ratio is just above 3. For the German region of Wolfsburg, where VW has its headquarters, the ratio is 3.4. But the list is headed by the UK, where the «Inner London - West» region has a GDP as much as 5.8 times as high as the national GDP.

All in all, Scots who are dissatisfied with the distribution of income in the UK clearly have a point. Should the No camp find itself looking for someone to blame on 19 September, then perhaps Ms. Thatcher might qualify.

Map of all of Europe here.

Method

I used Eurostat data on gross domestic product per inhabitant by NUTS 3 regions in 2011. NUTS 3 are the smallest regions used by Eurostat and have populations ranging from 150,000 to 800,000. 2011 is the most recent year for which data are available. The map is from EuroGeographics. The R code for the analysis is available here.

Of course, comparing regional GDP to national GDP is just one way of measuring inequality; other measures may produce somewhat different outcomes. It would be interesting to use wealth rather than income data, but I doubt that wealth data are available for regions.

Identifying «communists» at the New York Times, by 1955 US Army criteria

A while ago, Open Culture wrote about a 1955 US Army manual entitled How to spot a communist. According to the manual, communists have a preference for long sentences and tend to use expressions like:

integrative thinking, vanguard, comrade, hootenanny, chauvinism, book-burning, syncretistic faith, bourgeois-nationalism, jingoism, colonialism, hooliganism, ruling class, progressive, demagogy, dialectical, witch-hunt, reactionary, exploitation, oppressive, materialist.

What happened in the 1950s is pretty terrible, but that doesn’t mean we can’t have a bit of fun with the manual. I used the New York Times Article Search API to look up which of its writers actually use terms like hootenanny, book-burning and jingoism. The results are summarised below.

Interestingly, many of the users of «communist» terms are either foreign correspondents or art, music and film critics. While it’s possible that people who have an affinity with the arts tend to sympathise with communism, an alternative explanation would be that critics have more freedom than «regular» journalists to use somewhat exotic and expressive terms like the ones the US Army associated with communism.

Also of interest is that one of the current writers on the list is Ross Douthat, the main conservative columnist of the New York Times. In his articles, he uses terms like materialist, oppressive, reactionary, exploitation, vanguard, ruling class, progressive and chauvinism. Surely he wouldn’t be a reformed communist - would he?

Method

The New York Times Article Search API is a great tool, but you have to keep in mind that digitising the archive isn’t an entirely error-free process. For example, sometimes bits of information end up in the lastname field that don’t belong there (e.g. "lastname": "DURANTYMOSCOW"). While it’s possible to correct some of these issues, it’s likely that search results will in some way be incomplete.

To get a manageable dataset, I looked up all articles containing any combination of two terms from the manual. I then calculated a score for each author by simply counting the number of unique terms they have used.

An alternative would have been to correct for the total number of articles per author in the NYT archive. It took me a while to figure out how to search by author using the NYT API. It turns out you can search for terms appearing in the byline using ?fq=byline:("firstname middlename lastname") - even though this option isn’t mentioned in the documentation. I’m not entirely sure such a search will return articles where the byline/original field is empty.

As you might expect, there’s a correlation between the number of articles per author and the number of unique terms this author has used.

All in all, it would be possible to calculate a relative score, for example number of terms used per 1,000 articles, but this may have unintended consequences. To take an extreme example: an author who has written one article which happened to contain three terms would get a score of 3,000 using this method, whereas an author who has thousands of articles and consistently uses a broad range of terms but not at a rate of three per article would get a (considerably) lower score.

I decided to stick with the absolute number of unique terms per author. This has the disadvantage that authors who have written few articles are unlikely to show up in the analysis, but I’m not sure that this problem can be adequately solved by calculating a relative score.

The Python and R code used to collect and analyse the data is available on Github.

King’s Day associations lose tax exempt status

Don’t ask me why, but Oranjeverenigingen (Orange Associations - most focus on organising festivities on King’s Day) seem to be struggling with the new transparency rules of the tax authority.

Recently, new rules have been introduced for organisations that want to receive tax-exempt donations. Among other things, they must have a website and publish the compensation their board members receive. As a consequence of these new rules, over two thousand organisations have had their «anbi status» withdrawn, broadcaster NOS reported.

The tax authority has published a dataset on organisations that have or used to have the anbi status. It appears that especially Oranjeverenigingen have been affected. Six percent of all organisations had their anbi status withdrawn, but this happened to 75% of organisations with «oranje» in their name. Obviously, it’s a bit risky to draw conclusions from this as long as the explanation of the phenomenon is unclear.

Method

Data from the tax authority are here, and here’s the R script I analysed the data with. I also checked this for other terms that occur frequently (organisations with the Dutch word for «first aid», «christian», «jehova», «education», «amsterdam», «third world aid shop» or «museum» in their names), but they don’t show the same pattern.

Driekwart Oranjeverenigingen verliest belastingvrijstelling

Vraag me niet hoe het komt, maar Oranjeverenigingen lijken grote moeite te hebben met de nieuwe transparantieregels van de Belastingdienst.

Enige tijd geleden zijn er strengere regels gekomen voor organisaties die in aanmerking willen komen voor belastingvrije giften. Zo moeten ze een website hebben en bekendmaken welke vergoedingen hun bestuursleden ontvangen. Als gevolg van deze aanscherping zijn ruim tweeduizend organisaties hun «anbi-status» kwijtgeraakt, zo meldde de NOS.

De Belastingdienst publiceert een keurig overzicht met gegevens over organisaties die de anbi-status hebben of hebben gehad. Het lijkt erop dat vooral Oranjeverenigingen moeite hebben met de nieuwe regels. Bij 6% van de organisaties is de anbi-status ingetrokken, maar bij organisaties met «oranje» in de naam is dat maar liefst 75%. Een voor de hand liggende slag om de arm: zolang de verklaring onduidelijk is, is het natuurlijk wat hachelijk om hier conclusies aan te verbinden.

Methode

Hier zijn de gegevens van de Belastingdienst en met dit R-script heb ik de gegevens geanalyseerd. Ik heb het ook voor andere veelvoorkomende termen gecheckt (organisaties met «ehbo», «christelijk», «jehova», «onderwijs», «amsterdam», «wereldwinkel» of «museum» in de naam), maar voor die organisaties wijkt het percentage niet zo extreem af.

Not ditching R for Python just yet

As a result of the whole controversy over using Python vs R for statistical analysis and graphs, I thought I’d switch to Python. Mostly because I think it’s more practical to use the same language for different tasks, but also because it seems easier to make decent-looking graphs with Python (I’m sure some people will thoroughly disagree). And, of course, because googling for solutions using «Python» as a search term simply works better than searching for «R».

But now Brian Caffo, Roger Peng and Jeff Leek’s Data Science Specialization Course has started on Coursera and they use R. I guess I’ll have to postpone my decision.

Pages