champagne anarchist | armchair activist

Python

Apparently, it’s still possible to fool Google

Researchers have found that men are almost six times more likely than women to be shown ads on news websites for a career coaching service for $200k+ executive positions. The findings suggest some of the algorithms involved in tracking internet users have discriminatory outcomes. They might lead to «deeper investigations by either the companies themselves or by regulatory bodies», the authors add (via WP).

Not just the findings are interesting, but so is the research method. The researchers created AdFisher, basically a smart web scraper built with Python. AdFisher can create large numbers of «agents», have them visit certain websites or alter their profile via the Google Ad Settings, and then see what ads it gets shown on websites like the Times of India or the Guardian. Further, it will organise these activities in such a way that experimental and control conditions can be compared, and it will even analyse the results, using machine learning to figure out what may have triggered differences in what ads are shown.

Somehow this reminded me of the patent Apple (!) obtained for a cloning service to fool the companies that are tracking you. The service would mimick some of your normal online behaviour, but also do other stuff, such as faking an interest in basket weaving. This way it would contaminate the profile these companies keep of you, perhaps to the point of making it useless.

So would you be able to get away with that? If you open a bunch of browser windows with Google searches, Google will ask you to fill out a captcha to make sure you’re human («Our systems have detected unusual traffic from your computer network. This page checks to see if it’s really you sending the requests, and not a robot»). This is a very simple example, but given the fast-developing ability to analyse patterns in online behaviour, you’d expect that companies like Google and Facebook would have become eerily accurate at identifying (real) internet users and telling them from bots.

Against that background, it’s somehow reassuring that it’s apparently still possible to fool Google by creating a fake profile.

P.s. I’ve never been shown ads for a career coaching service for $200k+ executive positions, but if they do turn up I’ll just tell Google I’m a woman.

Tweeting #oxi

The responses of European leaders to the outcome of last Sunday’s referendum in Greece were pretty unanimous. Germany’s vice-chancellor Sigmar Gabriel (a social-democrat) said Tsipras had torn down the bridges between Greece and the rest of Europe. Spanish PM Mariano Rajoy said Greece must follow Europe’s rules. And Dutch PM Mark Rutte somewhat pedantically said he was «really angry» about the referendum and that the Greeks better not come up with a «lame story» (flutverhaal).

For a different perspective, I turned to Twitter. The hashtag #oxi, associated with a ‘no’ vote in the referendum, has become a bit of a symbol of opposition to EU-imposed austerity. I collected some 110,000 tweets containing #oxi (and not #nai) from around last Sunday. The #oxi tweets that are geotagged are shown on the map. It appears that quite a few tweets came from Spain and Italy, but also from the UK and Ireland, and - who’d have thought - the Netherlands (one wonders if bar de Druif in Amsterdam is an #oxi stronghold). In Spain, #oxi territory seems to overlap with areas where progressive party Podemos won in the mayoral elections earlier this year.

Note that only a small proportion of tweets are geotagged, so one shouldn’t rush to conclusions based just on the map. An alternative approach is to look at the language of tweets.

To interpret these findings properly one should take various factors into account, including the number of people who speak a language and how many are on Twitter. But whichever way you look at it, the number of Spanish-language #oxi tweets is impressive. There may well be a connection with the popularity of Podemos.

To get an idea of the contents of the #oxi-tweets I looked up the most-favourited tweets in some of the key languages. A few examples:

The joy of losing fear. Long live Greece! (es)

Today I’m going to eat a Greek tortilla. And what’s that? The same as the Spanish one, but with more huevos [eggs / balls]. (es)

Threats. Blackmail. Fear. Propaganda. The courageous Greek people defied it all. But now they desperately need our help. (en)

Tonight I feel truly European. As if Greece had voted for me against the technocrats and austerity. (fr)

A small, proud nation can change Europe. We should help them (it)

What if we take #oxi as an opportunity to rigorously curtail the world of banks, speculators and finance across the EU? (de)

Method

I searched the Twitter api for tweets using the search terms #oxi and #nai. I analysed tweets containing either #oxi or #nai (not both). Some have argued that ochi would be more appropriate than oxi; in French sometimes oki is used and of course the Greeks have their own alphabet. That said, #oxi appears to be a pretty universal symbol for a no vote in the Greek referendum and for opposition to austerity.

The number of #nai tweets was very small (less than two thousand). Locations of tweets were derived from the location data provided by the Twitter api. As indicated, only a small number of tweets contain this information; further, there may be cultural differences in the extend to which people allow their device to send location data with their tweets. Twitter also provides language data which appears to be pretty accurate (although they occasionally mistake Catalan for French). Note that language data cannot be simply linked to countries: for example, quite a few tweets in Dutch will be from Belgium while on the other hand, Dutch twitterers frequently tweet in English.

I used Python to collect and process the data, R for analysis and d3.js and Leaflet for visualisation.

GroenLinks, «lievelingetje van journalisten»?

Media hebben veel aandacht besteed aan het vertrek van Bram van Ojik als leider van GroenLinks en aan zijn opvolger, Jesse Klaver. Bij Telegraaf-columnist Paul Jansen viel dit verkeerd (paywall):

Het onderstreept wat iedereen aan het Binnenhof allang weet: GroenLinks is een lievelingetje van journalisten.

Klopt dat? Je kan die vraag op verschillende manieren beantwoorden. Ik heb geteld hoe vaak Tweede Kamerleden worden geciteerd in artikelen op de website van de NRC. De grafiek laat de resultaten zien.

De rode stippen tonen het gemiddeld aantal vermeldingen per fractie. Het lijkt erop dat de NRC relatief veel aandacht besteedt aan partijen die een sleutelrol vervullen bij het creëren van meerderheden voor regeringsbeleid. Daarnaast is er veel aandacht voor Geert Wilders (PVV) en Henk Krol (50PLUS). GroenLinks is bij NRC-journalisten niet echt favoriet; de krant noemt vaker Kamerleden van 50PLUS, PVV en D66.

De grijze stippen laten de score zien van individuele Kamerleden. Bij de PvdA en de VVD is de ongelijkheid tussen backbenchers en mediapolitici het grootst: daar is de hoogste score 50 keer zo hoog als de mediaan. Ook bij PVV en D66 is de ongelijkheid vrij groot.

Dan nog iets anders: De NRC noemt mannelijke Kamerleden gemiddeld bijna drie keer zo vaak als hun vrouwelijke collega’s. Bij de mannelijke Kamerleden zijn er enkele met extreem hoge scores die het gemiddelde omhoogtrekken, maar zelfs als je naar de mediaan kijkt worden mannen bijna twee keer zo vaak genoemd als vrouwen. Hier heeft de NRC iets uit te leggen.

Methode

Ik heb me op de NRC gericht omdat hun website relatief makkelijk doorzoekbaar is. Zoektermen heb ik opgebouwd als "voornaam tussenvoegsel achternaam" partij. Bij dubbele achternamen gescheiden door een koppelteken heb ik het laatste deel weggelaten (bijvoorbeeld Magda Berndsen in plaats van Magda Berndsen-Jansen). Als begindatum heb ik 20 september 2012 genomen, de datum waarop de huidige Tweede Kamer werd geïnstalleerd. Bij Kamerleden die korter in de Kamer zitten heb ik een correctie toegepast. Voor de overzichtelijkheid heb ik afsplitsingen weggelaten bij de analyse per partij. De scripts zijn hier beschikbaar.

Strava tweets II: after dinner rides and Sunday morning rides

The other day I posted an article about using Strava tweets to analyse road cycling patterns. I plan to do some more analysis on this but first I wanted to take another look at the time at which tweets are posted. Below is a chart that shows the number of Strava tweets per hour of the day.

Two things stand out: on weekdays, there’s an after-dinner peak, and on Sundays, many trips are finished before lunch. The pattern suggests that people tend to tweet pretty quickly after they finish their ride. This in turn seems to suggest that post times may well be a meaningful indicator of the time at which rides take place.

Gender

I used a variant of this script to determine the gender of people who tweeted their Strava rides, based on the first name of their Twitter screen name. According to the results, 9.7% are women. This is more than the 5.5% women in the SWOV survey among Dutch road cyclists, but then again people who use Strava (and tweet about it) are probably more likely to be young and young road cyclists more likely to be women.

For women the median distance of rides is 48km; for men 54km. The difference doesn’t appear very large.

In the chart above, you can select to see data for women instead of all riders (note that the scale changes). The main difference seems to be that for women, there’s much less of an after-dinner peak on weekdays. Perhaps something to do with the fact that women are less likely to have full-time jobs. But the numbers are relatively small so perhaps one shouldn’t read too much into it.

Using strava tweets to analyse cycling patterns

A recent report by traffic research institute SWOV analyses accidents reported by cyclists on racing bikes in the Netherlands. Among other things, the data show an early summer dip in accidents: 53 in May, 38 in June and 51 in August. A bit of googling revealed this is a common phenomenon, although the dip appears to occur earlier than elsewhere (cf this analysis of cycling accidents in Montréal).

Below, I discuss a number of possible explanations for the pattern.

Statistical noise

Given the relatively small number of reported crashes in the SWOV study, the pattern could be due to random variation. Also, respondents were asked in 2014 about crashes they had had in 2013, so memory effects may have had an influence on the reported month in which accidents took place. On the other hand, the fact that similar patterns have been found elsewhere suggests it may well be a real phenomenon.

Holidays

An OECD report says the summer accident dip is specific for countries with «a high level of daily utilitarian cycling» such as Belgium, Denmark and the Netherlands. The report argues the drop is «most likely linked to a lower number of work-cycling trips due to annual holidays».

If you look at the data presented by the OECD, this explanation seems plausible. However, holidays can’t really explain the data reported by SWOV. Summer holidays started between 29 June and 20 July (there’s regional variation), so the dip should have occured in August instead of June.

Further, you’d expect a drop in bicycle commuting during the summer, but surely not in riding racing bikes? I guess the best way to find out would be to analyse Strava data, but unfortunately Strava isn’t as forthcoming with its data as one might wish (in terms of open data, it would rank somewhere between Twitter and Facebook).

A possible way around this is to count tweets of people boasting their Strava achievements. Of course, there are several limitations to this approach (I discuss some in the Method section below). Despite these limitations, I think Strava tweets could serve as a rough indicator of road cycling patterns. An added bonus is that the length of the ride is often included in tweets.

The chart above shows Dutch-language Strava tweets for the period April 2014 - March 2015. Whether you look at the number of rides or the total distance, there’s no early summer drop in cycling. There’s a peak in May, but none in August - September.

Sunset

According to the respondents of the SWOV study, 96% percent of accidents happened in daylight. Of course this doesn’t rule out that some accidents may have happened in the dusk and there may be a seasonal pattern to this.

Many tweets contain the time at which they were tweeted. This is a somewhat problematic indicator of the time at which trips took place, if only because it’s unclear how much time elapsed between the ride and the moment it was tweeted. But let’s take a look at the data anyway.

I think tweets tend to be posted rather early in the day. Also, the effect of switches between summer and winter time is missing in the median post time (perhaps Twitter converts the times to the current local time).

That said, the data suggests that rides take place closer to sunset during the winter, not during the months of May and August which show a rise in accidents. So, while no firm conclusions should be drawn on the basis of this data, there are no indications that daylight patterns can explain accident patterns.

Weather

Perhaps more accidents happen when many people cycle and there’s a lot of rain. In 2013, there was a lot of rain in May; subsequently the amount of rain declined, and there was a peak again in September (pdf). So at first sight, it seems that the weather could explain the accident peak in May, but not the one in August.

Conclusion

None of the explanations for the early summer drop in cycling accidents seem particularly convincing. It’s not so difficult to find possible explanations for the peak in May, but it’s unclear why this is followed by a decline and a second peak in August. This remains a bit of a mystery.

Method

Unfortunately, the Twitter API won’t let you access old tweets, so you have to use the advanced search option (sample url) and then scroll down (or hit CMD and the down arrow) until all tweets have been loaded. This takes some time. I used rit (ride) and strava as search terms; this appears to be a pretty robust way to collect Dutch-language Strava tweets.

It seems that Strava started offering a standard way to tweet rides as of April 2014. Before that date, the number of Strava tweets was much smaller and the wording of the tweets wasn’t uniform. So there’s probably little use in analysing tweets from before April 2014.

I removed tweets containing terms suggesting they are about running (even though I searched for tweets containing the term rit there were still some that were obviously about running) and tweets containing references to mountainbiking. I ended up with 9,950 tweets posted by 2,258 accounts. 1,153 people only tweeted once about a Strava ride. Perhaps the analysis could be improved by removing these.

I had to add 9 hrs to the tweet time, probably because I had been using a VPN when I downloaded the data.

A relevant question is how representative Strava tweets are of the amount of road cycling. According to the SWOV report, about two in three Dutch cyclists on racing bikes almost never use apps like Strava or Runkeeper; the percentage is similar for men and women. The average distance in Strava tweets is 65km; in the SWOV report most respondents report their average ride distance is 60 - 90km.

In any case, not all road cyclists use Strava and not all who use Strava consistently post their rides on Twitter (fortunately, one might add). Perhaps people who tweet their Strava rides are a bit more hardcore and perhaps more impressive rides are more likely to get tweeted.

Edit - the numbers reported above are for tweets containing the time they were posted; this information is missing in about one-third of the tweets.

Here’s the script I used to clean the twitter data.

Pages