champagne anarchist | armchair activist

Data

Why make us use Tableau

I’ve almost finished the course Introduction to Data Science at Coursera and it’s a great course - especially the assignments. That said, I’m slightly disappointed they made us use Tableau for the final visualisation assignment. What’s wrong with Tableau:

  • It’s not open source;
  • It doesn’t run on a Mac;
  • You have to register and provide all kinds of personal details that are none of their business.

For those who are not familiar with it: Tableau is a tool to create (interactive) maps, graphs and other visualisations. It’s practical. The visualisations it produces don’t look really bad and they don’t look really good either. I wouldn’t mind using it for quick and dirty data exploration - the same way you sometimes do some quick Excel graphs to explore your data.

That is, if it wasn’t for the combined drawbacks listed above. Then again, I’m probably not in their target market, given their focus on business intelligence.

Counting unofficial retweets

One way of finding out who’s influential on Twitter is to count how often people are retweeted. I did so when analysing the Twitter discussion on the election of the new president of the Dutch trade union confederation FNV.

I counted both ‘official’ retweets – retweets acknowledged by Twitter – and ‘unofficial’ retweets. Unofficial retweets may have been generated by unofficial Twitter apps (I think) or users may have typed them manually. They may have the pattern RT@username:text (which is also the pattern of official retweets), the pattern "@username:text", or the pattern text via @username (this pattern wasn’t in my original analysis). Perhaps there are more flavours around that I don’t know of.

When looking for background information, I came across a comment by an SEO analyst explaining why they don’t count unofficial retweets:

To try to count non-official RTs is a messy business, as it would require a lot more Twitter API calls for possibly negligible benefit. Why negligible? We make an assumption that non-official RTs correlate strongly with official RTs. We can then use the latter as a proxy for the former. This assumption may not be true, of course. That is, by not using non-official RTs, we may ignore pockets of users who generate many more unofficial RTs... perhaps those who ask a question, or invite a response? (comment by Pete Bray on this article)

Below is some information on how often users in my FNV sample were retweeted within that same sample.

Prevalence of types of retweets
Official retweet RT@username:text "@username:text"text via @username
Sum 3,544 113 9860

At least within this sample, unofficial retweets are not very common: they make up about seven percent of all retweets. And here’s some information on how official and unofficial retweets are correlated:

Correlations between types of retweets (spearman)
Official retweet RT@username:text "@username:text"text via @username
Official retweet 1 0.28 0.250.13
RT@username:text 0.28 1 0.140.10
"@username:text" 0.25 0.14 10.13
text via @username0.130.100.131

Users who generate more official retweets also tend to generate more unofficial retweets, but the correlation is not particularly strong. So based on this sample, it would seem conceivable that there are indeed ‘pockets of users who generate many more unofficial RTs’ – as suggested by Bray.

Method

The sample contains close to 11,000 tweets containing the string FNV, collected between 26 April and 16 May. For background see this article; the analysis of retweets in the FNV debate is here. The code I used for the analysis above is here.

If you have a sample of tweets and you want to know how often users in that sample have been retweeted, you can only find that out for retweets that are also in the same sample. In my case that wasn’t a problem, for I was interested in who was influential within a specific discussion. However, if you’d be interested in constructing a general measure of how influential twitter users are, you’d probably need a pretty large sample of tweets.

The messiest type of retweet is probably text via @username. Often these aren’t real retweets but added by services like sharethis or AddThis or by news websites that have their own share service (I only included users if they were already in the sample, i.e. had tweeted texts containing FNV; this eliminates sharethis and AddThis tweets). I looked for the pattern via @ followed by any number of non-whitespace characters at the end of the line, or followed by any number of non-whitespace characters before the first whitespace. This method may not be 100% accurate, but I think it’ll do. The regex patterns used to find the different types of retweets are in the code.

Because the retweet counts are not normally distributed (many have a value of 0) I used spearman rank correlation; pearson’s correlation would have yielded stronger - but still not particularly strong - correlations of up to 0.5.

Hoofdrolspelers in de Twitterdiscussie over de FNV-voorzitter

Nederlands - Ik heb de hoofdrolspelers in de twitterdiscussie over de verkiezing van de nieuwe FNV-voorzitter in beeld gebracht. Omdat de grafiek niet in deze kolom past, heb ik er een aparte pagina voor gemaakt. Werkt het best met Chrome.

English - I’ve visualised the key players in the debate on Twitter on the election of the new president of the Dutch trade union confederation FNV. Since it doesn’t fit in this column, I’ve created a separate page. Best viewed in Chrome.

Pages