champagne anarchist | armchair activist

Data

How to do fuzzy matching in Python

Statistics Netherlands (CBS) has an interesting dataset containing data at the city, district and neighbourhood levels. However, some names of neighbourhoods have changed, specifically between 2010 and 2011 for Amsterdam. For example, Bijlmer-Centrum D, F en H was renamed Bijlmer-Centrum (D, F, H).

In some of those cases the neighbourhood codes have changed as well, and CBS doesn’t have conversion tables. So this is one of those cases where you need fuzzy string matching.

There’s a good Python library for that job: Fuzzywuzzy. It was developed by SeatGeek, a company that scrapes event data from a variety of websites and needed a way to figure out which titles refer to the same event, even if the names have typos and other inconsistencies.

Fuzzywuzzy will compare two strings and compute a score between 0 and 100 reflecting how similar they are. It can use different methods to calculate that score (e.g. fuzz.ratio(string_1, string_2) or fuzz.partial_ratio(string_1, string_2). Some of those methods are described in this article, which is worth a read.

Alternatively, you can take a string and have Fuzzywuzzy pick the best match(es) from a list of options (e.g., process.extract(string, list_of_strings, limit=3) or process.extractOne(string, list_of_strings)). Here, too, you could specify the method to calculate the score, but you may want to first try the default option (WRatio), which will figure out which method to use. The default option seems to work pretty well.

Here’s the code I used to match the 2010 CBS Amsterdam neighbourhood names to those for 2011:

import pandas as pd
from fuzzywuzzy import process
 
# Prepare data
 
colnames = ['name', 'level', 'code']
 
data_2010 = pd.read_excel('../data/Kerncijfers_wijken_e_131017211256.xlsx', skiprows=4)
data_2010.columns = colnames
data_2010 = data_2010[data_2010.level == 'Buurt']
names_2010 = data_2010['name']
 
data_2011 = pd.read_excel('../data/Kerncijfers_wijken_e_131017211359.xlsx', skiprows=4)
data_2011.columns = colnames
data_2011 = data_2011[data_2011.level == 'Buurt']
names_2011 = data_2011['name']
 
# Actual matching
 
recode = {}
for name in names_10:
    best_match = process.extractOne(name, names_11)
    if best_match[1] < 100:
        print(name, best_match)
    recode[name] = best_match[0]
 

It prints all matches with a score below 100 so you can inspect them in case there are any incorrect matches (with larger datasets this may not be feasible). With the process option I didn’t get any incorrect matches, but with fuzz.partial_ratio, IJplein en Vogelbuurt was matched with Vondelbuurt instead of Ijplein/Vogelbuurt.

PS In case you’re actually going to work with the local CBS data, you should know that Amsterdam’s neighbourhoods (buurten) were reclassified as districts (wijken) in 2016, when a more detailed set of neighbourhoods was introduced. You can translate 2015 neighbourhood codes to 2016 district codes:

def convert_code(x):
    x = 'WK' + x[2:]
    x = x[:6] + x[-2:]
    return x

Nieuwe kanshebbers in de Amsterdamse raad

Het Parool bespreekt de kansen voor een aantal nieuwkomers om op 21 maart volgend jaar voor het eerst een zetel te winnen in de Amsterdamse gemeenteraad. Bij wijze van achtergrondinformatie laat de onderstaande tabel zien hoeveel procent van de stemmen zij in Amsterdam kregen bij de Tweede Kamerverkiezing afgelopen maart.



Lijst Uitslag 2017
DENK 6.9
Artikel 1 2.5
50PLUS 1.9
ChristenUnie 1.5
Forum voor Democratie 1.2
Piratenpartij 0.5

De Amsterdamse raad heeft 45 zetels, dus je hebt ruim twee procent van de stemmen nodig voor een plek in de raad.

Uiteraard kan een uitslag bij landelijke verkiezingen niet zomaar voorspellen wat de uitslag volgend jaar zal zijn. Zo oppert het Parool dat de Piratenpartij misschien wel kan profiteren van het feit dat we op 21 maart waarschijnlijk ook mogen stemmen over de Sleepwet. PVV doet in Amsterdam niet mee, dat kan gunstig uitpakken voor FvD.

Wikinieuws is een uitstekend initiatief, maar is het levensvatbaar?

Ik heb besloten om relevante artikelen van Wikinieuws over te nemen op Nieuws uit Amsterdam. Wikinieuws is een zusterproject van Wikipedia. De vrijwilligers van de Engelstalige versie omschrijven zich als volgt:

We are a group of volunteers whose mission is to present reliable, unbiased and relevant news. All our content is released under a free license. By making our content perpetually available for free redistribution and use, we hope to contribute to a global digital commons. Wikinews stories are written from a neutral point of view to ensure fair and unbiased reporting.

Het lijkt me wat te makkelijk om simpelweg te beweren dat je neutraal en onbevooroordeeld te werk gaat. De kracht van Wikinieuws is eerder dat het ruimte biedt voor discussie over nieuwsberichten en hun bronnen, op basis van argumenten. In tijden van clickbait, hoaxes en fake news is dat een interessant idee .

Wikinieuws functioneert zonder advertenties (en dat is belangrijk). De website ziet er misschien een beetje spartaans uit, maar aangezien het allemaal open source is kan iedereen de inhoud hergebruiken met een andere opmaak.

Vooralsnog bestaat Wikinieuws voor een groot deel uit samenvattingen van nieuws dat in andere media is verschenen, maar de site staat ook open voor genres zoals onderzoeksjournalistiek, (foto-) reportages en opiniestukken. Dat biedt ook mogelijkheden voor een andere doelstelling van Wikinieuws, namelijk aandacht besteden aan nieuws dat onderbelicht blijft in andere media.

Maar is Wikinieuws levensvatbaar? De algemene opvatting lijkt te zijn van niet. Jonathan Dee van de New York Times schreef in juli 2007: «Wikinews … has sunk into a kind of torpor; lately it generates just 8 to 10 articles a day on a grab bag of topics that happen to capture the interest of its fewer than 26,000 users worldwide …».

Dat was tien jaar geleden. Sindsdien is de productie van de Engelstalige versie van Wikinieuws nog verder ingezakt, zoals de grafiek hieronder laat zien. Ondertussen is er wel een opmerkelijke groei van het aantal artikelen op de Nederlandstalige Wikinieuws.

Misschien is dit een tijdelijke uitbarsting van enthousiasme, die na verloop van tijd weer inzakt. Aan de andere kant, misschien zet het door en vindt Wikinieuws zich opnieuw uit. Ik ben benieuwd hoe dit zich verder gaat ontwikkelen.

Wikinews is a great idea, but is it viable?

I’ve decided to start reposting relevant Dutch-language articles from Wikinews on News from Amsterdam. Wikinews is a sister project of Wikipedia. Its contributors describe themselves as follows:

We are a group of volunteers whose mission is to present reliable, unbiased and relevant news. All our content is released under a free license. By making our content perpetually available for free redistribution and use, we hope to contribute to a global digital commons. Wikinews stories are written from a neutral point of view to ensure fair and unbiased reporting.

It would seem a bit naive to simply claim that your writing is neutral and unbiased. The strength of Wikinews rather lies in providing a place were the merits of a story and its sourcing can be discussed on the basis of arguments. In times of clickbait, hoaxes and fake news, that’s an interesting concept.

Wikinews operates without ads (which matters). Their website may look a bit austere, but since it’s all open source, anyone can resuse the content with a different layout.

For now, Wikinews mainly consists of syntheses of news published by other media, but the site also invites other types of stories such as investigative reporting; (photo) reports and opinion articles. This could also provide room for another goal of Wikinews, which is to cover stories that are underreported in other media.

But is Wikinews viable? The consensus seems to be that it’s not. As Jonathan Dee of the New York Times put it in July 2007: «Wikinews … has sunk into a kind of torpor; lately it generates just 8 to 10 articles a day on a grab bag of topics that happen to capture the interest of its fewer than 26,000 users worldwide …».

That was ten years ago. Since, activity on the English-language version of Wikinews has further declined, as the chart below shows. Meanwhile, there’s a remarkable rise in the number of articles on Dutch-language Wikinews.

Perhaps this is a temporary boost of enthusiasm, that will fade out after a while. Then again, maybe it’ll last and Wikinews will reinvent itself. It’ll be interesting to see how this develops.

Charts on mobile screens - always tricky

The chart shown here is from Dutch national statistics office CBS and it was created with Highcharts (or at least that name appears a few dozen times in the source code). It shows working people with second jobs and it’s from this page.

There’s a problem with the labels on the x-axis: they have been abbreviated and only show the first digit of the years. This could easily have been avoided by showing fewer labels when the chart is displayed on a narrow screen. In fact, that solution is used in the other graphs in the same article. Apparently, Highcharts doesn’t adjust the number of labels automatically, or it doesn’t do so consistently.

It appears that CBS has chosen Highcharts because it’s an easy way to create charts with added functionality. And some of that functionality seems to make sense: I can imagine people using the option of downloading a PNG to use it in a report or share it on social media.

However, it may not always be a good idea to rely on standard solutions. Here’s another example where the labels have been abbreviated. It’s a bit of a challenge to figure out what they mean.

I know it can be a pain to get charts to work well on different types of screens (let alone network graphs). Apparently, you cannot simply rely on Highcharts to get it right. CBS should probably assign someone to edit each chart individually, making sure they are displayed properly on different screen types.

As for the contents of the charts: here’s why it’s not OK that more and more Dutch workers need to take a second job to make ends meet (in Dutch).

Pages