Spamming after all? Revisiting the repost ratios of Vox, Upshot and 538
Recently I wrote about people who share their URLs on Twitter, and then post them again, hoping to draw even more people to their site. I said that FiveThirtyEight reposts its URLs on average 0.3 times. I was wrong: it reposts its URLs far more often. And so do voxdotcom and UpshotNYT, who didn’t even make the top 5 in my original analysis. The Upshot reposts its URLs on average as many as 0.8 times.
The reason I underestimated the repost ratios in my original analysis has to do with the fact that tweets tend to contain shortened URLs. http://nyti.ms/1rFwue2 and http://nyti.ms/1iIujpo look like different URLs. However, they point to the same article, so one should be treated as a repost of the other (or perhaps both are a repost of yet another one, who knows). If you don’t take this into account and treat them as different URLs, you’ll underestimate the number of reposts (red bar in the graph).
It’s not that I wasn’t aware of this problem when I did the first analysis. I first tried to account for this by looking up the non-shortened URLs, using the Python urllib2 module. It turned out this was very time-consuming, which was a problem since I wanted to look up quite a few URLs. Pragmatically, I decided instead to use the ‘expanded URL’ provided by the Twitter API. This method does yield higher repost ratios for 538 and the Upshot (grey bars in the graph). Still, it doesn’t really solve the problem, because the expanded URL provided by the Twitter API will sometimes be yet another shortened URL. That’s the reason I still underestimated how often people recycle their content on Twitter.
When I realised the ratios I had originally calculated were still rather low given how many reposts there appeared to be in my timeline, I decided to recalculate repost ratios using urllib2 after all. Because this method is so time-consuming, I did this for just three accounts: Vox, 538 and Upshot NYT. This resulted in repost ratios that are substantially higher (light blue bars in the graph). The new Python script is here.
Note that the ratios are snapshots calculated on a sample of the 200 most recent tweets (that is, about one to two weeks of tweets).