JavaScript

Scraping Airbnb

Airbnb is not exactly keen to share data that might help analyse its impact on local housing markets. In 2016, the Amsterdam Municipality decided to collect Airbnb data using a scraper - a computer programme that automates the job of retrieving information from web pages.

Amsterdam is not the only government to use web scraping. Increasingly, this technique is used to obtain data about topics ranging from consumer prices to jobs vacancy statistics and business data. Collecting data from the internet has advantages, but it also poses some challenges. It may be difficult to aggregate data coming from different websites, and data found online may not cover all aspects of a phenomenon you’re trying to understand (for example, not all job vacancies are published online). On a more practical level, your web scraper code may break when websites change.

In March 2017, Amsterdam reported that its weekly scrapes of major platforms like Airbnb required little maintenance. But last week, it sent a report to the city council describing how Airbnb has been making changes to its website - perhaps in an attempt to frustrate efforts to collect information about its business practices. Initially, Amsterdam’s digital surveillance department succesfully updated its scraper, but following new changes to the Airbnb website since May 2018, Amsterdam now appears to have given up on scraping Airbnb.

This made me curious about the technical characteristics of the Airbnb website. Here are some observations, based on an (admittedly superficial) examination:

  • The initial download of a web page isn’t the final version: after downloading, the contents of the page are dynamically altered using Javascript. For some purposes like navigating search results, you may prefer the final version of the page, which you can get using Selenium. Selenium would especially come in handy for interacting with the calendar to get availability and price information, which seems to be rather tricky.
  • Some details on listings only appear to be available in the Javascript code. You can find them using patterns like '\"lat\"\:(.*?),\"lng\"\:(.*?),'
  • Airbnb uses NGINX to control access to its website. If you request too many pages too fast, you’ll hit a rate limit and get an error page. I guess it should be possible to avoid the rate limit by adding pauses to your programme, but it may take quite some time to figure out how often and how long they should be.

While it appears that barriers to scraping the Airbnb website may be surmountable, it’s quite possible that I underestimate what this would take. If you’d actually build a scraper and would use it to frequently collect information about all local listings, all kinds of new problems might arise.

Meanwhile, other sources of Airbnb data are available. In a previous post, I used data made available by Tom Slee and by Murray Cox’ Inside Airbnb. Slee has since stopped updating his data, but Inside Airbnb is still active. As the Amsterdam Municipality notes in its report, Inside Airbnb has succesfully adapted its scraping technique each time Airbnb changed its website.

UPDATE 13 May - See comments on Twitter: Jens von Bergmann from Vancouver also has a scraper that is working. Following some requests, Tom Slee recently updated his scraper; his code is available on Github.

Search Facebook by date

Henk van Ess and Daniel Endresz have created a tool to search Facebook by date or date range. The tool creates a url containing the search criteria (as with Facebook Graph). It uses Javascript to generate the search urls. For example, this is how the date range url is generated:

function generate_url_timerange() {
 
    var keyword = $('#input-timerange-keyword').val();
 
    var day1 = $('#select-timerange-day1').val();
    var month1 = $('#select-timerange-month1').val();
    var year1 = $('#select-timerange-year1').val();
 
    var day2 = $('#select-timerange-day2').val();
    var month2 = $('#select-timerange-month2').val();
    var year2 = $('#select-timerange-year2').val();
 
    var url = 'https://www.facebook.com/search/str/'+keyword+'/stories-keyword/'+day1+'/'+month1+'/'+year1+'/date-3/'+day2+'/'+month2+'/'+year2+'/date-3/stories-2/intersect'
 
    $('#btn-search-timerange').attr('href', url);
}

The tool has been published with an open source license. The creators indicate that they «respect your privacy and the cases you are working on, so we are not storing any searches you will make» - which is nice, even if it would seem to be of little consequence since you need to be logged into Facebook to use the tool.

Embedding tweets in Leaflet popups

I just created a map showing where so-called Biro’s (small cars) are parked on the pavement and annoying people. Twitter has quite a few photos of the phenomenon. In some cases, finding their location took a bit of detective work.

First you’ll need the embed code for the tweet. You can get it manually from the Twitter website, but if you want to automate your workflow, use a url like the one below. It’ll download a bit of json containing the embed code:

https://publish.twitter.com/oembed?url=https://twitter.com/nieuwsamsterdam/status/958761072214896640

When trying to embed the tweets in Leaflet popups, I ran into a few problems:

  • When popups open, the markers didn’t properly move down. As a result, most of the popup would be outside the screen. The problem and how to solve it are described here.
  • Twitter embed code contains a script tag to load a widget. Apparently you can’t execute javascript by adding it directly to the html for the popup content, but you can add it using a selector (cf here).

Here’s the code that’ll solve both problems:

map.on('popupopen', function(e) {
    $.getScript("https://platform.twitter.com/widgets.js");
    var px = map.project(e.popup._latlng); 
    px.y -= e.popup._container.clientHeight;
    map.panTo(map.unproject(px),{animate: true});
});

You may also want to do something about the width of the popups, because otherwise they will obscure most of the map on mobile screens and it will be difficult to close a popup (which you can normally do by clicking outside of the popup). You can change the width of embedded tweets, but this will not change the width of the popup itself. A simple solution is to give popups a maxWidth of 215 (.bindPopup(html, {maxWidth: 215})).

Of course, you could also vary maxWidth depending on screen width, but I think 215px works well on all screens. Further, embedded tweets appear to have a minimum width of about 200px, so if you want popups narrower than 215px you’ll have to figure out a way to fix that.

If you embed tweets, Twitter can track people who visit your webpage. Add <meta name="twitter:dnt" content="on"> to your page and Twitter promises they won’t track your visitors. I wasn’t sure whether this should be put in the web page itself or in the html content of the popups (I opted for both).

If the popups have a somewhat spartan look and do not contain photos: Good for you! You’re probably using something like Firefox with tracking protection enabled. This blocks sites which have been identified as ‘engaging in cross-site tracking of users’ - including, apparently, platform.twitter.com.

DuckDuckGo shows code examples

Because of Google’s new privacy warning, I finally changed my default search engine to DuckDuckGo.[1] So far, I’m quite happy with it. I was especially pleased when I noticed they sometimes show code snippets or excerpts from documentation on the results page.

Apparently, DDG has decided that it wants to be «the best search engine for programmers». One feature they’re using are the instant answers that are sometimes shown in addition to the ‘normal’ search results. These instant answers may get their contents from DDGs own databases - examples include cheat sheets created for the purpose - or they may use external APIs, such as the Stack Overflow API. Currently, volunteers are working to improve search results for the top 15 programming languages, including Javascript, Python and R.

One could argue that instant answers promote the wrong kind of laziness - copying code from the search results page rather than visit the original post on Stack Overflow. But for quickly looking up trivial stuff, I think this is perfect.


  1. I assume the contents of the privacy warning could have been reason to switch search engines, but what triggered me was the intrusive warning that Google shows in each new browsers session - basically punishing you for having your browser throw away cookies.  ↩

Interactive charts - are Dygraphs or Plotly alternatives for D3?

There are quite a few Javascript libraries that you can use to create interactive graphs (with the added bonus that your graphs look crisp: somehow my PNG images always end up looking blurry). Some of these libraries are based on D3.js and are intended to make coding with D3 easier:

The sheer number of D3 based charting tools gives a good indication of how much people love the D3’s functionality, and yet actually hate coding with D3 directly.

Personally, I don’t hate coding with D3. Still, because I only use it sporadically, I’m constantly figuring out how things worked again (unsurprisingly, this involves a lot of code recycling and a lot of googling). So I wouldn’t be averse to an easier alternative for creating basic line graphs that don’t really require the advanced capabilities of D3.

The other day, I posted an article on how newspapers use the word illegal, which included two rather dense spaghetti graphs (I’m using the term in the non-technical sense of the word). I added dropdowns to link labels to lines. Labour relations scholar Maarten Hermans suggested I take a look at Dygraphs instead. A few days later, I read at Flowing Data that the Plotly.js library is being open sourced. So let’s give them a try (note that I haven’t checked how they do on mobile devices).

Dygraphs

As far as I can tell, Dygraphs can only create line graphs. Below is a Dygraphs version of one of the spaghetti graphs from my article on the word illegal.

I have no problem with their somewhat dull colour scheme, although I’d probably change that. Note that you can click-and-drag to zoom in on a part of the chart and double-click to zoom out again.

This chart type isn’t really suitable for a graph with this many lines: when you hover your mouse over the graph, the labels obscure much of the graph. You could move the labels outside the graph area, but then you’d have to reserve quite a bit of space to accomodate them.

So I also did a version of a simpler graph, from an article in which I traced the use of the word machine for bicycle.

In this chart I made some modifications (line colour, position of the labels). I found that making these changes is relatively straightforward and well-documented. I’m not entirely happy yet with how the labels look, but other than that I think the graph looks pretty much OK.

Plotly

Same approach with Plotly: here’s a version of the spaghetti graph from the article on the word illegal:

Plotly has similar click-and-drag / double-click functionality to zoom in and out as Dygraphs (it works slightly different in that you can also zoom in vertically). I think the way they show the labels to the right of the chart is OK. If you click them, the line associated with the label disappears and reappears, so you can easily find the line associated with a label. It would be even better if clicking a label would make the associated line toggle between grey and coloured.

I’m not happy with what happens when you hover your mouse over the graph. You get a menu to the top that offers functionality that strikes me as superfluous. And the labels that pop up are obviously a mess with a spaghetti graph like this one.

So here’s a Plotly version of the simpler bicycle graph:

That’s better, although I still haven’t figured out how to get the title to left-align (CSS doesn’t seem to work and here they only explain how to set font-family and size). More importantly, I still think there’s too much stuff popping up when you hover your mouse over the graph.

Discussion

I can see myself using Dygraphs on occasion if I need a simple line graph. As for Plotly, I don’t like all the clutter in their charts and while I’m sure you can get rid of that, I’m afraid that would defeat the purpose of making it easier to create graphs.

Come to think of it, it seems that quite a few data visualisation libraries, and I’m not just thinking of Plotly here, haven’t really outgrown the «look at all the neat tricks I can do» phase. Until they move in the direction of Edward Tufte’s more minimalistic approach to data visualisation, I think I’ll rather keep struggling with D3.

Pages