Opening Surveymonkey files in R

Many people use Surveymonkey to conduct online surveys. You can get standard pdf reports of your data, but often you’ll want to do some more analysis or have more control over the design of the charts. An obvious option is to read the data into R. But there’s a practical problem: Surveymonkey uses the second row of it’s output file for answer categories and puts some other information in that row as well. This has the additional effect that R will treat numerical variables as factors.

I wrote a few lines of code which, I think, deal with that problem and turned that into an R package. Until recently it’d never have occured to me to create an R package, but then I read this post by Hillary Parker who describes the process so clearly that it actually appeared doable. I took some additional cues from this video by trestletech. The steps are described here.

I thought of adding a function to read data from Limesurvey, an open source alternative to Surveymonkey. But apparently, that functionality is already available (I haven’t tested it).

The package is available on Github.

Step by step: creating an R package

With the help of posts by Hillary Parker and trestletech I managed to create my first R package in RStudio (here’s why) . It wasn’t as difficult as I thought and it seems to work. Below is a basic step-by-step description of how I did it (this assumes you have one or more R functions to include in your package, preferably in separate R-script files):

If you want, you can upload the package to Github. Other people will then be able to install it:


Sevillanas. The Spanish punk

Update 11 January: Spotify data added.
According to the English Wikipedia page, «Generally speaking, a sevillana is very light hea[r]ted, happy music». There’s certainly some bland stuff around, but many sevillanas are explosive and raw. In fact, sevillanas are the punk of Spanish music.

I wanted to back this claim up by pointing to the length of the songs on the legendary Sevillanas de los Cuarenta album. It’s a known fact that punk is a genre with very short songs: on average 2:58 according to this analysis by blogger Dale Swanson. It’s the shortest of all the genres he analysed. Well, the average song length on the Sevillanas de los Cuarenta album is 2:44.

However, there may be some problems with this argument. First, some of the songs on the album have a haunting quality about them (for example, A flamenca no me ganas by Gracia de Triana), which makes you wonder if they haven’t been played too fast when they were recorded for CD. This may be an issue, but even if you correct for this the songs on Sevillanas de los Cuarenta would still be shorter than punk songs (for details see below, Method).

More problematic is the fact that short songs appear to have been normal in the 1940s. According to this analysis by Rhett Allain, average song lengths rarely exceeded 3 minutes until the end of the 1960s (see also the debate in the comments on possible explanations). So the shortness of the songs on the Sevillanas de los Cuarenta album isn’t that impressive. In fact, a (possibly non-representative) sample of 1970s sevillanas has an average song length of 3:22, which appears to be quite typical for the 1970s judging by Allain’s data.

The Musicbrainz database used by Allain doesn’t seem to contain many sevillanas. However, the Discogs website, which has data on millions of songs, does contain a few hundred sevillanas. Since posting the first version of the article, I realised metadata can also be obtained from Spotify. Spotify has over 2,500 songs with «sevillanas» in the title but only a few hundred songs per genre for other genres (probably the genre tags aren’t applied consistently). Below is the song length of a number of genres in the Discogs and Spotify databases.

For especially jazz and house, Spotify has other durations than Discogs. Other than that, median song durations are very similar. This is actually quite remarkable given the differences between the datasets. In both datasets, sevillanas tend to be somewhat longer than punk songs, but shorter than the other genres in the analysis.

An analysis by year might be interesting, but tricky: first because the release year in the Discogs data may refer to the year in which an album or song was re-released and second because the number of sevillanas tracks with sufficient information isn’t large enough for that level of precision. The Spotify dataset has no information on the release year of tracks (I guess if I really wanted I could have looked up the release date of the album each track is on).

All in all, the average sevillanas may be somewhat longer than a punk song. But you can still argue that a sevillanas song is in fact a series of even shorter songs, as illustrated by the plot of ¡Ay Sevilla! by Los de la Trocha shown above. The typical sevillanas is a series of short bursts of music that can be as abrupt as any punk song.


Scripts for the analyses are available here.

Songs on Sevillanas de los Cuarenta too fast?
Spotify has three versions of A flamenca no me ganas: the one from Sevillanas de los Cuarenta (2:29 on cd) and two others lasting 2:37 and 2:41. This suggests it’s possible that the «correct» version is up to 8% longer than the one on Sevillanas de los Cuarenta. Even if you assume all the songs on the album should last 8% longer, the average length would become 2:56, still less than for punk. On the other hand, it’s doubtful that all songs on Sevillanas de los Cuarenta are too short. For example, Sevillanas del Espartero by Concha Piquer lasts 2:57 on Sevillanas de los Cuarenta, but Spotify has versions lasting only between 2:27 and 2:35.

1970s sevillanas
The sample of 1970s songs is from albums C, D and F of the HISPAVOX Sevillanas de Oro collection (cd versions), containing songs by los Marismeños, Amigos de Gines and others (not all Sevillanas de Oro albums contain the release year of the songs, but these do).

Discogs data
The Discogs data are available through an API and as monthly data dumps. I thought I’d spare myself the trouble of figuring out how the API works, so I opted for the data dump (the one for 1 December 2014). The downside is that the data is 2.8 GB zipped and 19.2 GB unzipped, so downloading and analysing the data takes a while.

The data dump is xml (the API should return json). I’m not really familiar with xml so I used some not very sophisticated, but effective, regex to sort it out. The data is organised in releases (e.g., albums) that have tags (e.g., for the year in which it was released and for genres and styles). The releases contain tracks that have their own tags, including duration. In order to filter out excessive track lengths I ignored any release containing the string mix and tracks with a duration longer than one hour.

Discogs uses hundreds of genre and style tags including some quite specific ones like ranchera and rebetiko, but not sevillanas. I decided to include only tracks with sevillanas in the title. This will exclude some legitimate sevillanas, but I reckon there probably won’t be too many false positives.

Spotify data
I accessed the Spotify data through their web api. As indicated in the article, genre searches resulted in only a few hundred results per genre, which suggests these tags are often omitted.

Plotting a waveform
Based on this discussion, plotting a waveform from a .wav music file using Python should be simple, but saving the plot turned out to be a problem (googling the error message OverflowError: Allocated too many blocks taught me I’m not the only one having that problem but I didn’t find a solution that worked for me). Instead I turned to R and found that the tuneR package will let you read and plot .wav files without a problem.

Scooters vaak sneller dan auto’s

Minister Schultz wil Amsterdam de mogelijkheid geven om scooters te verbannen van het fietspad en gebruik te laten maken van de weg, met een helm op. Dit moet het fietspad veiliger maken voor fietsers en zorgen dat ze minder fijnstof inademen. Auto- en scooterlobbyisten vinden echter dat het snelheidsverschil tussen auto’s en scooters te groot is. Met auto’s die 50 km/u rijden, is het voor scooters niet veilig om op de weg te rijden.

Maar halen automobilisten inderdaad 50 km/u in Amsterdam? «Fietsprofessor» Marco te Brömmelstroet heeft een kaart getweet die laat zien dat snelheden tijdens de avondspits vaak ver onder de 50 km/u liggen.

Als onderdeel van een open-datainitiatief heeft Amsterdam ongeveer 5 miljoen snelheidsmetingen op het Hoofdnet Auto tijdens de maand januari 2014 vrijgegeven. De grafiek hierboven laat zien dat, zelfs op het hoofdnet, de snelheid bij de meeste metingen lager dan 50 km/u was, met een mediaan van 31 km/u. Tijdens de avondspits ligt de snelheid nog gemiddeld 5 km/u lager dan ’s nachts.

Uit een onderzoek van de Fietsersbond uit 2011 bleek dat scooters gemiddeld 36,9 km/u rijden op fietspaden in Amsterdam. De kaart laat zien op welke wegen auto’s gemiddeld minstens 36,9 km/u (dunne rode lijnen) of 50 km/u (dikke rode lijnen) rijden. Overigens zou het kunnen dat de Fietsersbond de snelheid van scooters op een andere manier heeft gemeten dan de methode waarmee de snelheid van auto’s is gemeten.

Er zijn grappen gemaakt dat scooterrijders niet op de weg willen rijden omdat ze dan gedwongen zouden zijn om hun snelheid te minderen. De cijfers van de gemeente laten zien dat daar een kern van waarheid in zit.

Scripts voor de gegevensanalyse zijn hier te vinden.

Scooters often faster than cars

Minister Schultz wants to allow Amsterdam to ban scooters from cycle paths and make them use the road, wearing a helmet. This should make cycle paths safer for cyclists and reduce their exposure to air pollution. However, car and scooter lobbyists argue that the speed difference between scooters and cars is too large for scooters to ride safely on the road, with motorists driving 50 kmph.

So do motorists really make 50 kmph in Amsterdam? «Cycling professor» Marco te Brömmelstroet has tweeted a map showing rush hour speeds far below 50 kmph.

As part of its open data initiative, Amsterdam has released some 5 million speed measurements at the «Hoofdnet Auto» (the network of major roads for cars) during the month of January 2014. The histogram above shows that even at these main roads, the majority of measurements recorded a speed below 50 kmph, with a median speed of 31 kmph. Average speeds during afternoon rush hour were about 5 kmph lower than at night.

A 2011 study by cyclists’ organisation Fietsersbond found found an average speed for scooters on Amsterdam’s cycle paths of 36.9 kmph. The map shows roads where motorists drive on average at least 36.9 kmph (thin red line) or 50 kmph (thick red line). Note that the method by which the Fietsersbond measured scooter speed may be different from the method used to measure car speed.

There have been jokes that scooter riders don’t want to use the road because this would force them to reduce their speed. The data of the Amsterdam government show there’s actually some truth to this.

Scripts for processing the data can be found here.