Salonanarchist | Leunstoelactivist

Counting unofficial retweets

One way of finding out who’s influential on Twitter is to count how often people are retweeted. I did so when analysing the Twitter discussion on the election of the new president of the Dutch trade union confederation FNV.

I counted both ‘official’ retweets – retweets acknowledged by Twitter – and ‘unofficial’ retweets. Unofficial retweets may have been generated by unofficial Twitter apps (I think) or users may have typed them manually. They may have the pattern RT@username:text (which is also the pattern of official retweets), the pattern "@username:text", or the pattern text via @username (this pattern wasn’t in my original analysis). Perhaps there are more flavours around that I don’t know of.

When looking for background information, I came across a comment by an SEO analyst explaining why they don’t count unofficial retweets:

To try to count non-official RTs is a messy business, as it would require a lot more Twitter API calls for possibly negligible benefit. Why negligible? We make an assumption that non-official RTs correlate strongly with official RTs. We can then use the latter as a proxy for the former. This assumption may not be true, of course. That is, by not using non-official RTs, we may ignore pockets of users who generate many more unofficial RTs... perhaps those who ask a question, or invite a response? (comment by Pete Bray on this article)

Below is some information on how often users in my FNV sample were retweeted within that same sample.

Prevalence of types of retweets
Official retweet RT@username:text "@username:text"text via @username
Sum 3,544 113 9860

At least within this sample, unofficial retweets are not very common: they make up about seven percent of all retweets. And here’s some information on how official and unofficial retweets are correlated:

Correlations between types of retweets (spearman)
Official retweet RT@username:text "@username:text"text via @username
Official retweet 1 0.28 0.250.13
RT@username:text 0.28 1 0.140.10
"@username:text" 0.25 0.14 10.13
text via @username0.130.100.131

Users who generate more official retweets also tend to generate more unofficial retweets, but the correlation is not particularly strong. So based on this sample, it would seem conceivable that there are indeed ‘pockets of users who generate many more unofficial RTs’ – as suggested by Bray.

Method

The sample contains close to 11,000 tweets containing the string FNV, collected between 26 April and 16 May. For background see this article; the analysis of retweets in the FNV debate is here. The code I used for the analysis above is here.

If you have a sample of tweets and you want to know how often users in that sample have been retweeted, you can only find that out for retweets that are also in the same sample. In my case that wasn’t a problem, for I was interested in who was influential within a specific discussion. However, if you’d be interested in constructing a general measure of how influential twitter users are, you’d probably need a pretty large sample of tweets.

The messiest type of retweet is probably text via @username. Often these aren’t real retweets but added by services like sharethis or AddThis or by news websites that have their own share service (I only included users if they were already in the sample, i.e. had tweeted texts containing FNV; this eliminates sharethis and AddThis tweets). I looked for the pattern via @ followed by any number of non-whitespace characters at the end of the line, or followed by any number of non-whitespace characters before the first whitespace. This method may not be 100% accurate, but I think it’ll do. The regex patterns used to find the different types of retweets are in the code.

Because the retweet counts are not normally distributed (many have a value of 0) I used spearman rank correlation; pearson’s correlation would have yielded stronger - but still not particularly strong - correlations of up to 0.5.

Tags: 

Hoofdrolspelers in de Twitterdiscussie over de FNV-voorzitter

Nederlands - Ik heb de hoofdrolspelers in de twitterdiscussie over de verkiezing van de nieuwe FNV-voorzitter in beeld gebracht. Omdat de grafiek niet in deze kolom past, heb ik er een aparte pagina voor gemaakt. Werkt het best met Chrome.

English - I’ve visualised the key players in the debate on Twitter on the election of the new president of the Dutch trade union confederation FNV. Since it doesn’t fit in this column, I’ve created a separate page. Best viewed in Chrome.

Tags: 

Can Twitter predict the new Dutch trade union president

Number of tweets in which candidates are mentioned


According to an American study, you can predict the outcome of elections by simply counting how often the names of the candidates are mentioned on Twitter. Members of the Dutch union confederation FNV are currently voting for their new president (it has been claimed this is the first time in the world union members get to directly elect their confederation president). Would it be possible to predict who will be the new FNV president using Twitter?

Since last Friday, I’ve been collecting the tweets containing the term ‘FNV’; so far, there are over 2,500. In those tweets, the incumbent Ton Heerts is mentioned 204 times, whereas his challenger Corrie van Brenk is mentioned 146 times. In short, if Twitter is a good predictor (which of course is a matter for debate), the contest is tighter than one might have expected.

The graph above shows the results for the days for which complete data is available. On Saturday, Van Brenk got some attention because something she had said had been fact checked (and found to be correct). On Sunday, Heerts was mentioned because he appeared on a TV show hosted by Eva Jinek. On 1 May, it was officially announced who the candidates are and they had a debate.

Update - Updated to include 13 May, the final voting day. In sum, Van Brenk was mentioned 497 times and Heerts 631. It has since been announced that Heerts has won the election (of course, this doesn’t necessarily mean that the method is sound; in order to make such claims one would need to evaluate a fair amount of predictions).
Influences reflected in the graph include: Factcheck confirms Van Brenk statement (27 April); Heerts in Eva Jinek TV show (28 April); candidates officially announced (1 May); debate in Buitenhof TV show (5 May); problems at tax authorities that Van Brenk’s Abvakabo FNV had warned about (6 May); Van Brenk interview at Nu.nl (9 May); Van Brenk in radio show (10 May); Heerts at presentation of initiative to train technical staff (13 May); EenVandaag TV show poll predicts Heerts will win (13 May).
The graph may not be visible in older versions of Internet Explorer.

Method

I collected tweets using the Twitter Streaming API (the ‘firehose’), in the way described here. I prepared the data using Python and analysed it using R (find the code on Github). The graph was created with D3.js.
I looked into how influential twitterers are (how many followers, how often listed) and into their backgrounds (e.g., do they mention ‘fnv’ in their profile). The most important finding is that twitterers who mention Van Brenk, more often mention ‘abva’ or ‘akf’ in their profile - not surprising since Van Brenk is currently president of Abvakabo FNV, the public sector union affiliated to the FNV.
The American study on Twitter as a predictor of election outcomes was done by DiGrazia c.s. and can be found here. Some remarks on their study:

  • Yes, twitterers are only a small part of the population and no, they’re not representative of the entire population. Likely, Twitter is dominated by a small, active incrowd. It’s also correct that tweets mentioning a candidate need not endorse them; they may as well be critical. Despite all this, DiGrazia c.s. found that mentions on Twitter consistently predict election outcomes. Perhaps they are an indicator of something else - e.g. media attention or how actively people are campaigning for a candidate.
  • Of course, this method doesn’t provide any certainty on who will win. It’s possible for a candidate to get almost 100% of the tweet share and still lose (at least, that’s what the scatterplots of DiGrazia c.s. suggest).
  • It’s unclear to what extent the conclusions of the American study can be generalised to other situations. It’s therefore a bit of a gamble to use this method to predict who will be the next president of the FNV.

Comments

Submitted by Karissa McKelvey on

I am second author on the study, and I wanted to clarify - we only looked at names, such as "John Boehner," and did not also restrict to other strings like "FNV" in your case. The more parameters you add, it is possible you are eliminating larger portions of the sample.

Submitted by DIRKMJK on

Thanks for clarifying, Karissa. It’s a bit of a puzzle, how to include messages like ‘we want ton’ (a slogan used by Ton Heerts supporters) yet exclude all irrelevant tweets containing the string ‘ton’ (e.g., retweets of ‘@transportonline’). So I guess you’re right, filtering by ‘fnv’ is practical but not necessarily the optimal approach. Incidentally, I know you used a huge sample of tweets collected over a much longer period; I was wondering what the range was of the number of times candidates were mentioned in your study?

German coalition parties hardly ask any questions

Nice: der Spiegel has launched a data blog, Datenlese. One of the first posts analyses questions asked by members of the lower house, the Bundestag. I thought it might be interesting to compare these findings to the Dutch situation. Unfortunately, der Spiegel doesn’t appear to publish the actual dataset they use in their analysis (unlike, for example, the Guardian Data Blog, which usually provides a spreadsheet with all the relevant data). [Update: the author kindly provides a link to the dataset here]

However, the Bundestag does publish statistics of parliamentary initiatives, as does the Dutch Tweede Kamer. A few conclusions:

  • The Bundestag asks about 75 questions per month. The Tweede Kamer more than three times as many, even though the Bundestag has four times as many members.
  • Written questions are primarily a tool for the opposition, but more so in Germany than in the Netherlands. In Germany, only 1% of questions are asked by members of coalition parties. In the Netherlands, 17% (or even 33% if former quasi-coalition party PVV is included).
  • In Germany, most questions are asked by the left-wing party die Linke and by the green party. Far fewer questions are asked by the social-democrats. A spokesperson told der Spiegel that the party knows from its experience as a former government party that questions ‘can paralyse the entire apparatus’. In the Netherlands, the social-democrats asked the largest number of questions in 2011. This hasn’t always been the case: when the social-democrats were still in government, they asked fewer questions and the left-wing SP headed the list.

Data

Statistics of the current session of the Bundestag can be found here (pdf). Apparently, there is a distinction between ‘small’ and ‘large’ questions; the latter resulting in a debate. The number of large questions is very small; like der Spiegel I focused on the little questions. The Tweede Kamer is quite a bit slower than the Bundestag in publishing its statistics; I used the figures for 2011 published here.

Tags: 

US Congress’ interest in the world: the role of elections, trade and oil

Graphs may not be visible in older versions of Internet Explorer.

Number of times countries are mentioned by year


Select country:

Codeyear offers a course on how to use the API of the Sunlight Foundation to search transcripts of the US Congress. I used this approach to find out how often foreign countries are discussed in Congress. A simple inspection of the total frequencies suggests two conclusions:

  • Interest in foreign countries rose under the ‘Bush doctrine’ and fell since the start of the current economic crisis;
  • There are often peaks during odd years. Plausibly, Congress focuses more on domestic issues in even years, when there are elections for Congress.

Of course, the pattern may be different for individual countries (use the selector under the line graph to see data on individual countries). For example, interest in Afghanistan took off after 9/11; for Hong Kong it peaked in 1997 (transfer of sovereignty); for Serbia, Kosovo and Albania in 1999 (NATO bombing campaign); for Tunisia in 2011 (Arab Spring) and interest in Austria took off in 2009 (well that’s actually a mistake: in 2009, somebody named Steve Austria joined the US House of Representatives, boosting the number of times the term ‘Austria’ appears in transcripts).

Number of times countries are mentioned by population, GDP and trade

The scatterplots illustrate how the total number of times a country has been discussed in Congress over the period 1996-2012 is associated with population size, GDP and the amount of trade between that country and the US (note that the scales are log scales, a feature of D3.js; unfortunately I didn’t manage to get readable values on the x axis). Population, GDP and trade are correlated, so figuring out what exactly drives US Congress interest in a country remains an interesting challenge.

Interest in countries is also related to the presence of natural resources: for countries without oil, the median number of times they were discussed in Congress is 331; for countries with oil it is 900.

Of course this is just an exploratory analysis. An analysis at country/year level might yield more specific conclusions. If you want to do your own analysis, download the data here (country/year) and here (country).

Method

I searched transcripts of the US Congress using the Capitol Words api of the Sunlight Foundation, using country names as search terms. Of course, this method isn’t perfect. I had to remove country names that can’t be distinguished from names of US states (Georgia, Mexico). Afterwards, I realised that I should also have removed Austria, because of confusion with a representative with that name.

Because this is just an exploratory analysis, I took a rather pragmatic approach to selecting background information on countries. For GDP, I used data from the World Bank; data on population, trade (2009) and oil reserves are from Wikipedia. For the scatterplots, I removed countries with incomplete data.

Tags: 

Paste0

Paste0 in R is one of the things that we learned about in this week’s videos for the Data Analysis course. I didn’t think much of it at the time, but I was wrong! I just learned about statistical computing’s most influential contribution of the 21st century!

Tags: 

Blind followers on Twitter



Select group:

On 30 september, I posted the last article on Nieuws uit Amsterdam (News from Amsterdam). The website has been inactive since, apart from a message on 28 October formally announcing that the site is no longer active. As expected, the number of new followers of @nieuwsamsterdam on twitter dropped in October. Intriguingly, it started to rise again after that.

The list of new followers has been compiled from ‘You have new followers’ emails and may be incomplete. Graph may not work in older versions of Internet Explorer.

‘Trade unions should take a much tougher stance’

Dutch trade unions have a reputation for constructive dialogue, but that’s not necessarily what people expect of them. In the LISS Political Values study, some 6,000 panel members have been asked a number of times whether they agree with the statement ‘Trade unions should take a much tougher political stance, if they wish to promote the workers’ interests’. In the latest edition of the study, those who agree with this statement outnumber those who disagree by 2.6 to 1. This support for tougher unions holds for most subgroups (but not the self-employed and people earning more than 4,500 euros per month).

Support for tougher unions over time

Percentage of respondents who agree or disagree with the statement ‘Trade unions should take a much tougher political stance, if they wish to promote the workers’ interests’. Graph may not work with older versions of Internet Explorer. Source LISS, graph dirkmjk.


Support for tougher unions, by group

Select:

Values higher than 1 mean that within that group, those in favour of tougher unions outnumber those who disagree. For example, among people with paid employment, the number of respondents in favour of tougher unions is 3.5 times as high as the number who disagree. Hover mouse over bar to see percentages. Graph may not work with older versions of Internet Explorer. Source LISS, results for December 2011, graph dirkmjk.

Pages