My failed attempt to build an interesting twitter bot

26 March 2017

In 2013, I created @dataanddataviz: a twitter account that retweets tweets about data analysis, charts, maps and programming. Over time, I’ve made a few changes to improve it. And @dataanddataviz did improve, but I’m still not satisfied with it, so I decided to retire it.

There are all sorts of twitter bots. Often, their success is measured by how many followers they gain or how much interaction they provoke. My aim was different. I wanted to create an account that is so interesting that I’d want to follow it myself. Which, of course, is a very subjective criterion (not to mention ambitious).

Timing

First a practical matter: it has been suggested that you can detect twitter bots by the timing of their tweets. The chart below (inspired by this one) shows the timing of posts by @dataanddataviz.

I randomized the time at which @dataanddataviz posts. The median time between tweets is about 100 minutes (I lowered the frequency last January, as shown by the dark green dots). There is no day/night pattern. If tweets were posted manually, you’d expect a day/night pattern.

Selecting tweets

Initially, I collected tweets using search terms such as dataviz, data analysis and open data. From those tweets, I tried selecting the most interesting ones by looking how often they had been retweeted or liked. @dataanddataviz would retweet some of the more popular recent tweets.

This was not a success. For example, there are quite a few people who tweet conspiracy theories and include references to data and charts as «proof». Sometimes, their tweets get quite a few likes and retweets, and @dataanddataviz ended up retweeting some of those tweets. Awkward.

I decided to try a different approach: follow people who I trust, and use their retweets as recommendations. If someone I trust thinks a tweet is interesting enough to retweet, then it may well be interesting enough for @dataanddataviz to retweet.

The people who I follow tweet about topics like data and charts, but sometimes they tweet about other topics too. To make sure tweets are relevant, I added a condition that the text of the tweet contains at least one «mandatory» term (e.g. python, d3, or regex). I also added a condition that none of a series of «banned» terms was in the text of the tweet. I used banned terms for two purposes: filter out tweets about job openings and meetings (hiring, meetup) and filter out hypes (bigdata, data science).

This approach was a considerable improvement, but I still wasn’t happy. Sure, most of @dataanddataviz’ retweets now were relevant and retweets of embarrassing tweets became rare. But too few retweets were really good.

Predict quality?

I tried if I could predict the quality of tweets. I created an interface that would let me rate tweets that met the criteria described above: retweeted by someone I follow; containing at least one of the required terms and containing none of the banned terms.

The interface shows the text of the tweet and, if applicable, the image included in it, but not the names of the person who originally posted the tweet and the recommender who had retweeted it. This way, I forced myself to focus on the content of the tweet rather than the person who posted it. Rating consisted in deciding whether I would retweet that tweet.[1]

I rated 1095 tweets that met the basic criteria. Only 130 were good enough for me to want to retweet them. That’s not much.

I looked if there are any characteristics that can predict whether a tweet is - in my eyes - good enough to retweet. For example: text length; whether the text contains a url, mention or hashtag; and characteristics of the person who originally posted the tweet, such as account age; favourites count; followers count; friends count; friends / followers ratio; statuses count and listed count. None of these characteristics could differentiate between OK tweets and good tweets.

I also looked whether specific words are more likely to appear in good tweets - or vice versa. This was the case, but most examples are unlikely to generalise (e.g., good tweets were more likely to contain the word air or #demography).

Conclusion

I didn’t succeed in creating a retweet bot I’d want to follow myself. @dataanddataviz’ retweets are generally OK but only occasionally really good.

Also, I couldn’t predict tweet quality. Perhaps it would make a difference if I used a larger sample, or more advanced analytical techniques, but I doubt it. Subjective quality appears to be difficult to predict - which shouldn’t come as a big surprise (in fact, Twitter itself isn’t very good at predicting which tweets I’ll like, judging by their You might like suggestions).

Meanwhile, I found that since November, more of the tweets retweeted by @dataanddataviz tend to have a political content. Retweeting political statements isn’t something I want to delegate to a bot, so that’s another reason to retire @dataanddataviz.


  1. Obviously, what is being measured is a bit complicated. Whether I’d want to retweet a tweet depends not only on its quality, but also on its subject. For example, I’m now less inclined to retweet tweets about R than I was a couple years ago, because I started using Python instead of R.  ↩

26 March 2017 | Categories: data