Scraping Airbnb
Airbnb is not exactly keen to share data that might help analyse its impact on local housing markets. In 2016, the Amsterdam Municipality decided to collect Airbnb data using a scraper - a computer programme that automates the job of retrieving information from web pages.
Amsterdam is not the only government to use web scraping. Increasingly, this technique is used to obtain data about topics ranging from consumer prices to jobs vacancy statistics and business data. Collecting data from the internet has advantages, but it also poses some challenges. It may be difficult to aggregate data coming from different websites, and data found online may not cover all aspects of a phenomenon you’re trying to understand (for example, not all job vacancies are published online). On a more practical level, your web scraper code may break when websites change.
In March 2017, Amsterdam reported that its weekly scrapes of major platforms like Airbnb required little maintenance. But last week, it sent a report to the city council describing how Airbnb has been making changes to its website - perhaps in an attempt to frustrate efforts to collect information about its business practices. Initially, Amsterdam’s digital surveillance department succesfully updated its scraper, but following new changes to the Airbnb website since May 2018, Amsterdam now appears to have given up on scraping Airbnb.
This made me curious about the technical characteristics of the Airbnb website. Here are some observations, based on an (admittedly superficial) examination:
- The initial download of a web page isn’t the final version: after downloading, the contents of the page are dynamically altered using Javascript. For some purposes like navigating search results, you may prefer the final version of the page, which you can get using Selenium. Selenium would especially come in handy for interacting with the calendar to get availability and price information, which seems to be rather tricky.
- Some details on listings only appear to be available in the Javascript code. You can find them using patterns like
'\"lat\"\:(.*?),\"lng\"\:(.*?),'
- Airbnb uses NGINX to control access to its website. If you request too many pages too fast, you’ll hit a rate limit and get an error page. I guess it should be possible to avoid the rate limit by adding pauses to your programme, but it may take quite some time to figure out how often and how long they should be.
While it appears that barriers to scraping the Airbnb website may be surmountable, it’s quite possible that I underestimate what this would take. If you’d actually build a scraper and would use it to frequently collect information about all local listings, all kinds of new problems might arise.
Meanwhile, other sources of Airbnb data are available. In a previous post, I used data made available by Tom Slee and by Murray Cox’ Inside Airbnb. Slee has since stopped updating his data, but Inside Airbnb is still active. As the Amsterdam Municipality notes in its report, Inside Airbnb has succesfully adapted its scraping technique each time Airbnb changed its website.
UPDATE 13 May - See comments on Twitter: Jens von Bergmann from Vancouver also has a scraper that is working. Following some requests, Tom Slee recently updated his scraper; his code is available on Github.
UPDATE 20 November - Last spring, the city government has relaunched its digital inspection efforts under the name ‘project Sahara’. This was revealed in a report by the local Accounting Office published today. Difficulties remain: «Bottlenecks include being dependent on the information available on the platforms’ websites and digital inspection being a labour-intensive process. Each time the structure of the information changes, the ‘scrape’ has to be adjusted as well.» Still, digital inspection contributes to more effective enforcement, the Accounting Office finds. Amsterdam intends to step up its digital research efforts and has expanded its team of data analysts.
UPDATE 14 December - Yet another Airbnb-scraper. The purpose here is not to monitor Airbnb’s impact on the housing market, but rather to optimise return on investment by turning houses into Airbnb-businesses..
UPDATE 19 April 2020 - Using webscraped data on Airbnb listings, as well as other indicators, the city of Amsterdam has decided to ban holiday rentals in the neighbourhoods Burgwallen Oude Zijde, Burgwallen Nieuwe Zijde and Grachtengordel Zuid, all located in the city centre. The ban will enter into force on 1 July this year.
According to the Volkskrant, Airbnb objected that 95% of houses offered for rent on its platform are not located within those three neighbourhoods. While this piece of information is irrelevant for the decision taken by the city, it does seem to validate the data collected by Inside Airbnb. In its dataset for Amsterdam published on 13 March 2020, 922 out of 19,635 Airbnb listings are within those three neighbourhoods, according to the coordinates provided.
The City of Amsterdam’s selection of neighbourhoods where holiday rentals will be banned was based on an analysis (pdf) of ‘toeristische draagkracht’ (ability to support [more] tourism) of neighbourhoods. One of the indicators used is the number of unique Airbnb listings per 1,000 population, based on data collected through webscrapes carried out at ‘multiple random moments per month’. The analyis is limited to Airbnb listings, as it controls 80% of the holiday rental market in Amsterdam, and also because webscrapes of other platforms are less reliable.
Other indicators used include hotel capacity, coffeeshops per 1,000 population and reported liveability.
The new measure was announced at a moment when demand for Airbnb has dropped due to corona. Reportedly, houses are temporarily taken off Airbnb and offered for rent as ‘normal’ houses with three-month leases.