Embedding D3.js charts in a responsive website

UPDATE - better approach here.

For a number of reasons, I like to use D3.js for my charts. However, I’ve been struggling for a while to get them to behave properly on my blog which has a responsive theme. I’ve tried quite a few solutions from Stack Overflow and elsewhere but none seemed to work.

I want to embed the chart using an iframe. The width of the iframe should adapt to the column width and the height to the width of the iframe, maintaining the aspect ratio of the chart. The chart itself should fill up the iframe. Preferably, when people rotate their phone, the size of the iframe and its contents should update without the need to reload the entire page.

Styling the iframe

Smashing Magazine has described a solution for embedding videos. You enclose the iframe in a div and use css to add a padding of, say, 40% to that div (the percentage depending on the aspect ratio you want). You can then set both width and height of the iframe itself to 100%. Here’s an adapted version of the code:

<style>
.container_chart_1 {
    position: relative;
    padding-bottom: 40%;
    height: 0;
    overflow: hidden;
}
 
.container_chart_1 iframe {
    position: absolute;
    top:0;
    left: 0;
    width: 100%;
    height: 100%;
}
</style>
 
<div class ='container_chart_1'>
<iframe src='https://dirkmjk.nl/2016/embed_d3/chart_1.html' frameborder='0' scrolling = 'no' id = 'iframe_chart_1'>
</iframe>
</div>

Making the chart adapt to the iframe size

The next question is how to make the D3 chart adapt to the dimensions of the iframe. Here’s what I thought might work but didn’t: in the chart, obtain the dimensions of the iframe using window.innerWidth and window.innerHeight (minus 16px - something to do with scrollbars apparently?) and use those to define the size of your chart.

Using innerWidth and innerHeight seemed to work - until I tested it on my iPhone. Upon loading a page it starts out OK, but then the update function increases the size of the chart until only a small detail is visible in the iframe (rotate your phone to replicate this). Apparently, iOS returns not the dimensions of the iframe but something else when innerWidth and innerHeight are used. I didn’t have that problem when I tested on an Android phone.

Adapt to the iframe size: Alternative solution

Here’s an alternative approach for making the D3 chart adapt to the dimensions of the iframe. Set width to the width of the div that the chart is appended to (or to the width of the body) and set height to width * aspect ratio. Here’s the relevant code:

var aspect_ratio = 0.4;
var frame_width = $('#chart_2').width();
var frame_height = aspect_ratio * frame_width;

The disadvantage of this approach is that you’ll have to set the aspect ratio in two places: both in the css for the div containing the iframe and in the html-page that is loaded in the iframe. So if you decide to change the aspect ratio, you’ll have to change it in both places. Other than that, it appears to work.

Reloading the chart upon window resize

Then write a function that reloads the iframe content upon window resize, so as to adapt the size of the chart when people rotate their phone. Note that on mobile devices, scrolling may trigger the window resize. You don’t want to reload the contents of the iframe each time someone scrolls the page. To prevent this, you may add a check whether the window width has changed (a trick I picked up here). Also note that with Drupal, you need to use jQuery instead of $.

width = jQuery(window).width;
jQuery(window).resize(function(){
    if(jQuery(window).width() != width){
        document.getElementById('iframe_chart_1').src = document.getElementById('iframe_chart_1').src;
        width = jQuery(window).width;
    }
});

In case you know a better way - do let me know!

FYI, here’s the chart used as illustration in its original context.

Tags: 

Assignment 2-4

In previous assignments I’ve looked into the association between union membership and political participation among paid employees, using the Outlook On Life surveys dataset. I found that respondents who have a union member in their household are more likely to have engaged in political participation over the past 2 years. This was consistent with what I expected on the basis of a study by Kerrissey and Schofer.

In the present assignment we’re to check for a potential moderator. The study by Kerrissey and Schofer found that the association between union membership and political participation is stronger for lower educated respondents, possibly because they have fewer other sources of political capital at their disposal.

Against this background I decided to test the association between union membership and political participation for different subgroups based on education. The OOL dataset has a variable with four education levels (less than high school; high school; some college; bachelor’s degree or higher). Since there are relatively few respondents with less than high school, I decided to lump together the first two categories.

First of all, here’s a grouped bar chart showing what percentage of respondents have engaged in political participation, by union membership (at household level) and by education level. Political participation levels appear higher for higher educated respondents, which will not come as a surprise. More surprisingly, the association between union membership and political participation appears stronger for higher educated respondents.

So let’s take a look at the chi squares for the different education levels. The entire Python script for my analysis can be found here. Below I copy some of the output from the code:

measure: "political_participation", group: "employees"
 
 
Results for "low"
union                     No  Yes
political_participation          
0.0                      158   36
1.0                       87   24
 
chi-square value, p value, expected counts
(0.24815891922850686, 0.61837443272471648, 1, array([[ 155.83606557,   38.16393443],
       [  89.16393443,   21.83606557]]))
 
 
Results for "medium"
union                     No  Yes
political_participation          
0.0                      130   26
1.0                      124   41
 
chi-square value, p value, expected counts
(2.7736422284672679, 0.095827887556796373, 1, array([[ 123.43925234,   32.56074766],
       [ 130.56074766,   34.43925234]]))
 
 
Results for "high"
union                     No  Yes
political_participation          
0.0                      157   27
1.0                      154   60
 
chi-square value, p value, expected counts
(9.5760783080978147, 0.0019712903653131314, 1, array([[ 143.77889447,   40.22110553],
       [ 167.22110553,   46.77889447]]))

The results show that the chi square value is smallest for the lowest education group and largest for the highest education group; and only significant for the highest education group (note that a post-hoc tests is not required because the explanatory variable has only two levels).

This comes as a surprise. Based on the study by Kerrissey and Schofer, I expected that the asssociation between union membership and political participation would be stronger for the lower educated respondents. However, using the OOL data, the association is only significant for the highest education level.

Note for students reviewing this assignment: the elaboration below isn’t strictly speaking part of the assignment. I wouldn’t want to waste your time so feel free to skip the rest of the article and make your assessment based on the text above.

I can’t really explain why my analysis leads to a result that seems at odds with the Kerrissey and Schofer study, but here are some considerations.

First of all, it’s entirely possible that I made some silly mistake in my analysis. And if that’s not the case, the method applied by Kerrissey and Schofer is different in a number of ways from my analysis. For example, they did regression analyses taking a number of relevant background variables into account. Further, they found a significant interaction between union membership and education in two different datasets. One could argue that Kerrissey and Schofer’s analysis is superior and their finding therefore more credible. Even so, it would be nice to be able to explain why a simpler model results in an opposite outcome.

Second, characteristics of respondents might play a role. I have the impression that union members may be overrepresented in the OOL dataset, but I don’t immediately see how that would explain the different outcome. More importantly, I did my analysis on a subset consisting of respondents with paid employment. It’s entirely possible that paid employees tend to be higher educated than unemployed and retired respondents. I guess it wouldn’t hurt rerunning the analysis on the entire group of respondents.

Third, it may matter how you define and measure political participation. I used a measure that includes contacting an official, participating in a protest or march and signing a petition. Kerrissey and Schofer found an interaction for voting, protest and membership. It would be interesting to see what happens if I use just the protest variable instead of the composite measure.

All respondents, composite measure

When I run my analysis on the entire group of respondents rather than just paid employees, the outcome changes in that there’s now a significant association between union membership and participation, not just for the highest education group, but also the medium education group. For the lowest education group, there’s still no significant association. So this doesn’t really explain the difference.

measure: "political_participation", group: "all_respondents"
 
 
Results for "low"
union                     No  Yes
political_participation          
0.0                      466   73
1.0                      279   60
 
chi-square value, p value, expected counts
(2.4819670432296759, 0.11515814971338957, 1, array([[ 457.35193622,   81.64806378],
       [ 287.64806378,   51.35193622]]))
 
 
Results for "medium"
union                     No  Yes
political_participation          
0.0                      262   32
1.0                      279   82
 
chi-square value, p value, expected counts
(14.963425544693942, 0.00010961532600357433, 1, array([[ 242.83053435,   51.16946565],
       [ 298.16946565,   62.83053435]]))
 
 
Results for "high"
union                     No  Yes
political_participation          
0.0                      237   34
1.0                      307   95
 
chi-square value, p value, expected counts
(12.134008533047874, 0.00049510583539001945, 1, array([[ 219.05497771,   51.94502229],
       [ 324.94502229,   77.05497771]]))

All respondents, protest measure

Using the protest measure rather than the composite participation measure, the association is once again only significant for the highest educated group.

measure: "protest_demo", group: "all_respondents"
 
 
Results for "low"
union          No  Yes
protest_demo          
0.0           695  119
1.0            49   15
 
chi-square value, p value, expected counts
(2.9184647984510526, 0.087571143400387977, 1, array([[ 689.76765376,  124.23234624],
       [  54.23234624,    9.76765376]]))
 
 
Results for "medium"
union          No  Yes
protest_demo          
0.0           499   99
1.0            43   15
 
chi-square value, p value, expected counts
(2.5743436452143076, 0.10860914232973193, 1, array([[ 494.07926829,  103.92073171],
       [  47.92073171,   10.07926829]]))
 
 
Results for "high"
union          No  Yes
protest_demo          
0.0           487  104
1.0            52   23
 
chi-square value, p value, expected counts
(6.5436232099967668, 0.010526075473228987, 1, array([[ 478.3018018,  112.6981982],
       [  60.6981982,   14.3018018]]))

After these additional analyses, it’s clear that it makes a difference whether you include respondents who are not paid employees, but I don’t think that fully accounts for the difference between the analysis using the OOL dataset and Kerrissey and Schofer’s analyses. Using a ‘protest’ variable instead of a broader composite measure of political participation also didn’t help clear things up. I’m afraid I still don’t really have an explanation for the different outcomes.

‘Open company data played role in downfall of Spanish minister’

How transparent are countries when it regards company data? Score of the Netherlands on Open Corporates’ Open Company Data Index, compared to other EU countries. Ordered by score and alphabetically on English name. Source Open Corporates, chart dirkmjk.nl.

«There is a delicious irony in Soria being brought down in part by open data», Open Corporates wrote on their blog a week ago. By Soria they refer to former minister José Manuel Soria of the right-wing Partido Popular, who had just stepped down. The story, as summarised by Open Corporates:

Soria was discovered in the Panama Papers, but denied any connection to the Bahamas company referenced in them. It turns out that a company of the same name, UK Lines Limited, had been incorporated in the UK, with officerships linked to him and his family. Further investigation into this company and another UK one, Oceanic Lines Limited, used company filings and shareholder documents to show that these were indeed connected with Soria and his family. Yesterday, newspaper El Mundo nailed the case showing Soria was also director of a Jersey company when he was already a politician.

Information about the UK connection was obtained from Open Corporates. Journalists in other countries - from Nigeria to Argentina - have similarly used data from Open Corporates to make sense of the Panama Papers.

The information they used may well have been available from official databases as well. However, the fact that countries like the UK have opened up company data, and that Open Corporates serves as a portal to such information, makes it much easier to investigate abuses compared to a situation in which you have to buy each document you want to take a look at.

So what about the ‘delicious irony’ mentioned at the beginning of this article? Spain happens to be one of the most secretive countries in the world when it comes to company data, according Open Corporates:

you can’t even search to see if a company exists without giving your credit card, and they have been adamant that they will not open up the register, still less make it available as open data.

This earns Spain a score of 0/100 on the Open Company Data Index, which is even worse than the embarrassingly low score of 20/100 for the Netherlands. The good news here is that the Dutch Lower House has passed a motion asking the government to see whether it can open up the company register (KvK) as open data, and to report to Parliament this spring.

Tags: 

Assignment 2-2

I’m using the OOL Surveys dataset and I’m interested in the association between union membership and political participation in the US (more specifically, between union membership at the household level and having engaged in at least one out of four forms of political participation over the past 2 years).

In the current assignment, we’re asked to run a chi square test of independence to figure out whether two categorical variables are related. If the outcome is significant and the explanatory variable has more than two levels, we’re required to carry out and interpret a post-hoc test. This would mean carrying out comparisons between all pairs of categories for the explanatory variable and dividing the required significance level (for example, 0.05) by the number of comparisons.

I’m in a bit of luck this time. First, my original research question concerns the relation between two categorical variables, so there’s no need to recode quantitative variables to categorical ones or to look for other variables. Second, my explanatory variable has only two levels (respondents either do or don’t have a union member in their household), so there’s no need to do a post-hoc test.

The entire Python script for my analysis can be found here. Here’s an excerpt from the script:

# contingency table of observed counts
ct1=pandas.crosstab(sub2['ANY'], sub2['W1_P8'])
print (ct1)
print()
 
# column percentages
colsum=ct1.sum(axis=0)
colpct=100*ct1/colsum
print(colpct)
print ()
 
# chi-square
print ('chi-square value, p value, expected counts')
cs1= scipy.stats.chi2_contingency(ct1)
print (cs1)

And here’s the relevant output:

W1_P8   No  Yes
ANY            
No     445   89
Yes    365  125
 
W1_P8         No        Yes
ANY                        
No     54.938272  41.588785
Yes    45.061728  58.411215
 
chi-square value, p value, expected counts
(11.559955638910083, 0.00067387460877846761, 1, array([[ 422.40234375,  111.59765625],
       [ 387.59765625,  102.40234375]]))

Among respondents with union members in their household, the percentage who have engaged in political participation is higher (58%) than among other respondents (45%). There are 125 participants who have a union member in their household and who have engaged in political participation; had there been no relation between the two variables a lower number (102) were to be expected. For other answer categories, the observed values also differ from the values that were to be expected if there were no relation between the variables.

The chi square value is 11.6 and the p-value < 0.001. In other words, the outcome of the test is that there is indeed a significant relation between union membership (at household level) and political participation.

My entry for the Best Worst Viz competition

Number of tweets with hashtag #BestWorstViz, per date of the month April 2016 and time of the day. Times are UTC, 18 April is the deadline. Data updates every hour; clear browser history to refresh. Entry for Best Worst Viz competition, created by dirkmjk.

I love to hate bad graphs (who doesn’t), and I think Andy Kirk’s idea to organise a Best Worst Viz competition is quite brilliant. As he explains, there’s something fair about creating your own bad graph rather than criticising somebody else’s:

[..] picking on bad visualisation involves work by other people who we might never meet or have a chance to learn about what the true circumstances and intent of a project were. The essence of this challenge is based on your best worst visualisation - the best worst visualisation you can possibly make.

I had to give it a try. But how? An exploding 3D pie chart, truncated y-axis, out-of-control spaghetti chart - it all seemed a bit too obvious. I aimed for something different, drawing inspiration from the blink element of the early days of web design. The shifting colours of the stacked bar chart pointlessly illustrate the direction of time - or whatever. I think it’s pretty bad.

Standalone version of graph here.

Assignment 2-1

In this assignment, we’re asked to run an analysis of variance and then conduct post hoc paired comparisons. In earlier assignments I looked into the association between trade union membership and political participation, using the OOL Surveys dataset. The variables I considered are not suitable for the present assignment, so I’ll pick a different issue for now: possible regional variation in opinions about unionised workers (summary statistics about the latter variable here).

Background

Based on anecdotal evidence about successful union campaigns in the US - from Justice for Janitors to the Fight for 15 - I have the impression that unions are more active in some places (e.g. LA, San Francisco, Seattle, New York) than others. If this is correct, then I assume it’s possible that other aspects of trade unionism, such as union density and opinions of unionised workers, may also show regional variation.

By way of initial exploration, I created a map that shows the share of workers who are union members by state (I wrote Python scripts to scrape the data from a Wikipedia page and to modify this svg-map). Hover your mouse over the map.

Percentage of workers who are union members (Wikipedia); darker green represents higher density

The map suggests that union membership tends to be higher in states along the west coast and in the northeast of the US. Note that there could be overlap between these states and states where metropolitan areas are concentrated.

Code

While I think it’s more practical to link to a separate code file, the assignment says we should paste the code into the article, so here it is:

# Import relevant libraries
import pandas
import numpy
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi 
 
# Read data & print size of dataframe
data = pandas.read_csv('../../Data Management and Visualization/Data/ool_pds.csv', low_memory = False)
print (data.shape)
 
# Only variable W1_N1H contains missing values that need to be recoded
data['W1_N1H'] = data['W1_N1H'].replace(-1, numpy.nan).replace(998, numpy.nan)
 
print('ANOVA to compare means by MSA [metro] status')
model1 = smf.ols(formula = 'W1_N1H ~ C(PPMSACAT)', data = data)
results1 = model1.fit()
print(results1.summary())
sub1 = data[['W1_N1H','PPMSACAT']].dropna()
grouped = sub1.groupby('PPMSACAT')
sub2 = grouped['W1_N1H'].agg([numpy.median, numpy.mean, numpy.std, len])
print(sub2)
print()
 
print('Explore state-level data')
sub1 = data[['W1_N1H','PPSTATEN']].dropna()
grouped = sub1.groupby('PPSTATEN')
sub2 = grouped['W1_N1H'].agg([numpy.median, numpy.mean, numpy.std, len])
print(sub2)
print()
 
# Create subset including only respondents from states with at least 50 respondents
counts = dict(data['PPSTATEN'].value_counts())
include_states = [state for state in counts if counts[state] >= 50]
sub3 = data[data['PPSTATEN'].isin(include_states)]
sub4 = sub3.copy()
recode = {21: 'NY', 22: 'NJ', 23: 'PA', 31: 'OH', 33: 'IL', 34: 'MI', 43: 'MO', 52: 'MD', 54: 'VA', 56: 'NC', 58: 'GA', 59: 'FL', 62: 'TN', 63: 'AL', 74: 'TX', 93: 'CA'}
sub4['PPSTATEN'] = sub4['PPSTATEN'].map(recode)
 
print('ANOVA to compare means by state for states with at least 50 respondents')
model2 = smf.ols(formula = 'W1_N1H ~ C(PPSTATEN)', data = sub4)
results2 = model2.fit()
print(results2.summary())
print()
 
print('Post-hoc test [HSD] for state means')
sub5 = sub4[['W1_N1H', 'PPSTATEN']].dropna()
mc1 = multi.MultiComparison(sub5['W1_N1H'], sub5['PPSTATEN'])
res1 = mc1.tukeyhsd()
print(res1.summary())
print()
 
print('Print summaries per state')
grouped = sub5.groupby('PPSTATEN')
sub6 = grouped['W1_N1H'].agg([numpy.median, numpy.mean, numpy.std, len])
print(sub6)
print()

Analysis

All the code output can be found here.

First, I carried out an ANOVA to see whether opinions about union members tend to vary between metro and non-metro areas. There is a statistically significant difference (F = 7, p < 0.01), although the variance explained is small. Opinions about unionised workers are somewhat more favourable in metro areas (mean score 62 out of 100) than in non-metro areas (57). Since the metro variable has only 2 levels, there’s no need for a post-hoc test.

Exploratory analysis of the state variable reveals that a substantial number of states have very few respondents. I decided to create a subset consisting only of respondents in states with at least 50 respondents (I’ll admit I’m not sure whether this threshold makes sense and how to decide this).

Note that this is a somewhat unbalanced subset: most states that are included are from the eastern part of the US, with California and Texas as major exceptions. This probably makes sense if you’d look at the population size of states, but still it’s something to keep in mind when interpreting the findings.

The conclusion of an ANOVA testing differences in opinions between states is that there are indeed differences in mean opinion about unionised workers per state (F = 2.4, p < 0.01), although again the variance explained is quite small.

A post-hoc test (HSD) reveals that the differences can be attributed to the divergent position of Florida within this sample of states: respondents from Florida on average have less favourable opinions (a mean score of 54) than those from Illinois (67), Michigan (68), Missouri (69) and New York (66). The average rating by respondents from Texas (57) turns out not to be different from other states from the sample - at least not at a statistically significant level.

Links between businesses and politics II: revolving door and access to ministers

Eline Huisman and Ariejan Korteweg of the Volkskrant have done some good investigative journalism by finding out how often companies, organisations and inviduals have visited the current ministers (this data wasn’t publicly available in the Netherlands). It’s interesting to compare the top–10 of companies with access to ministers to the top–10 of revolving door companies (companies where national politicians have or have had a position).

Position of companies on the access to ministers ranking and the revolving door ranking

Access Revolving door
Air France-KLM 1 6
Rabobank 2 1
Shell 3 2
ING Bank 4 5
ABN AMRO 5 3
Schiphol 6 -
Aegon 7 8
KPN 8 -
SNS Reaal 9 -
KPMG 10 4
NS 7
Delta Lloyd - 9
PGGM 10

I’m sure more can be said about this, but the comparison shows there’s conciderable overlap between the two lists (for the geeks among you: the Jaccard index is 0.54). The following companies score high on both measures of political ties: Air France-KLM, Rabobank, Shell, ING Bank, ABN Amro, Aegon and KPMG. Dutch Railways (NS) and PGGM don’t feature in the Volkskrant business ranking because they classify them as semipublic.

Of course, these lists provide no basis for firm conclusions about cause and effect. However, one can imagine that companies that participate actively in the revolving door could have easier access to ministers.

The details of the Volkskrant investigation can be found in this visualisation, which unfortunately isn’t easily searcheable. The underlying data are available here as csv. If you’d classify NS and PGGM as companies in the Volkskrant list, the overlap wouldn’t change because other companies would drop out of the top–10. Further, for comparability I’ve removed industry and lobby organisations such as employers’ organisation VNO-NCW from the access to ministers ranking. Alphabetical order was used where two companies have the same score.

Tags: 

Assignment 1-4

A little background: I’m using the Outlook on Life Surveys dataset and I’m interested in the relation between union membership and political participation (background here). In my previous assignment, I discussed variables on current employment status; union membership (household level); various political participation variables and a secondary variable on whether respondents have engaged in any of the forms of political participation. I’ve created a subset consisting only of respondents with paid jobs, which I’m using for my analyses.

In the fourth assignment, we’re asked to create univariate graphs for the variables to be used and to create a bivariate graph for the association between the independent and dependent variables. Here are some remarks on how I did this assignment:

  • All the variables I’ve discussed so far are categorical variables. Since I also want to discuss the distribution of a quantitative variable, I’ve added a variable on how respondents rate unionised workers (on a 0–100 scale).
  • In this blog post, I only show graphs for the following variables: opinion of unionised workers; union membership; and the composite political participation measure (and a graph showing the association between union membership and participation, of course). In my Python script, I provide graphs for all the variables that I’ve used so far.
  • I’ve decided to remove missing values before creating the graphs in this blog post.[1]
  • In this blog post, I’m not using the graphs generated by the Python script. Instead, I’ve created new versions using D3.js (a Javascript library for data visualisation). The reasons are pragmatical: this results in crispier graphs than when you post image files. It also saves disk space.

I’ve posted my Python code here.

Opinion of unionised workers

How would you rate unionized workers on a 0-100 scale? Histogram of answers from OOL respondents with paid jobs

The histogram shows that the mode is between 50 and 60 (in fact, a look at the data reveals it’s 50); the median is 60 and the mean just above that. The measures of centrality suggest American respondents with paid jobs tend to have an indifferent or slightly positive view of unions.[2] The distribution is uneven: a substantial share of respondents have (very) negative views of unionised workers, a larger share have positive views and about one in four are indifferent.

The data suggests that the view Americans have of unionised workers is less polarised and somewhat more positive than I’d expected. Then again, as I noted earlier, there is a posibility that union members are overrepresented in the sample.

Union membership

Does anyone in your household currently belong to a union? Percentage for OOL respondents with paid jobs, missing values omitted.

The graph shows that about 20% of respondents indicate someone in their household is a union member. This is higher than I’d expect; as I just noted there’s a possibilty that union members are overrepresented in the sample.

Political participation

Have you engaged in any of the four forms of political participation in the last 2 years? Percentage for OOL respondents with paid jobs, missing values omitted.

The graph shows a composite measure of political participation. A positive score means respondents have contacted an official, participated in a protest or march or signed a petition in the past 2 years. This is the case for little less than half the respondents.

Association between union membership and political participation

Percentage who have engaged in any of the four forms of political participation in the last 2 years, for OOL respondents with paid jobs, by union membership (at household level).

Union member in household

And finally, the association between union membership (at household level) and political participation. Consistent with the hypothesis I formulated in the first assignment, political participation (at least by this measure) is higher for respondents with a union member in their household than for other respondents. Of course, the graph doesn’t tell us whether the correlation is statistically significant nor whether there’s a causal relationship between the two phenomena.


  1. I’m not yet entirely sure this is the best approach, but here are my considerations: my last graph (association) is categorical / categorical. I could include missing values using a stacked bar chart with different colours for ‘Yes’, ‘No’ and ‘missing’, but I’m afraid that would come at the expense of clarity. If I choose to omit missing values in the last graph, then the consistent thing to do would be to omit them in the other graphs as well. Like I said, I’m not yet entirely sure this is the best way to go, and that’s because I think it’s important to keep track of where the missing values are (especially when there are more of them). Perhaps the solution would be to do a combination of a graph and a frequency table and have the latter include missing values. Note that in my Python script I’ve opted for a slightly different approach by showing counts instead of percentages.  ↩

  2. I’m assuming that respondents interpret the scale to mean that 50 stands for indifferent, higher values for positive and lower for negative. I have to admit I’m not sure this is how American respondents would interpret the scale. For comparison, Dutch respondents might associate the scale with the 0–10 scale used in schools, on which a 6 means you pass and a 5 you fail, and correspondingly interpret 50 on a 0–100 scale as negative.  ↩

Tags: 

Assignment 1-3

A little background: I’m using the Outlook on Life Surveys dataset and I’m interested in the relation between union membership and political participation (background here). The third assignment is similar to the second one, only we’re required to do some data management before outputting the data. Therefore, I’ll submit an adapted version of the programme and blogpost of the previous assignment.

The output’s supposed to be ‘interpretable (i.e. organized and labeled)’. For those who are logged in to the course website, I refer to this forum post disucssing how I interpreted this requirement.

In terms of data management, I’ll perform the following steps: first recode ‘refused’ to NaN (not required for the first variable PPWORK, because it has no missing values) and recode the answers to labels (e.g. 1 = ‘Yes’). Next, I’ll create a secondary variable which indicates whether respondents have engaged in any of the four types of political participation discussed below.

The programme itself is posted here. Below I’ll discuss some of the output. For the sake of convenience, I’ll only show percentages (the raw counts can be obtained by running the programme). First, the current employment status of respondents.

PPWORK: Current Employment Status
Percentage
Not working - retired                           21.011334
Not working - on temporary layoff from a job     1.264167
Not working - looking for work                  10.854403
Not working - disabled                           8.456844
Not working - other                              6.451613
Working - self-employed                          6.190061
Working - as a paid employee                    45.771578
dtype: float64

One of the variables I’m interested in, is union membership. My understanding of the American situation is that union membership is often dependent on whether your workplace is organised (by contrast, in the Netherlands it’s not uncommon for unemployed or retired people to be union members). For that reason, it makes sense to look specifically at respondents who are working as paid employees. (The fact that union membership is measured at the household level complicates matters but that doesn’t change my preference to focus on paid employees.)

1,050 respondents (46%) are paid employees. This would seem to be a sufficiently large group for the purposes of the analyses I plan to do. In the programme, I created at subset of respondents who indicated they are working as paid employees. All output below is based on this subset.

Next, let’s take a look at the numbers for the variable on union membership (as indicated, at the household level).

W1_P8: Does anyone in your household currently belong to a union?
Percentage
NaN     1.142857
No     78.380952
Yes    20.476190
dtype: float64

Within the subset of respondents with paid employment, little over 20% indicate that at least one person in their household is a union member. This compares to a union density of 11.1% among wage and salary workers in the US according to the Bureau of Labour Statistics.

Some of that difference can be explained by the fact that the 20% figure will include some respondents who aren’t union members themselves but who have someone in their household who is. On the other hand, the BLS is a bit more persistent in assessing union membership, and would likely classify some people as union members who wouldn’t be classified as such in the OOL surveys.[1] All in all, I’m inclined to say the 20% figure in the OOL surveys is higher than expected and that there is a possiblity that the survey sample is in some way biased towards union members.

And finally the political participation measures.

W1_L4_A: [Contacted a public official or agency ] Please indicate if you have done any of the following activities in the last 2 years.
Percentage
NaN     2.190476
No     74.095238
Yes    23.714286
dtype: float64
 
W1_L4_B: [Attended a protest meeting or demonstration ] Please indicate if you have done any of the following activites in the last 2 years.
Percentage
NaN     2.190476
No     90.571429
Yes     7.238095
dtype: float64
 
W1_L4_C: [Taken part in a neighborhood march ] Please indicate if you have done any of the following activites in the last 2 years.
Percentage
NaN     2.095238
No     93.047619
Yes     4.857143
dtype: float64
 
W1_L4_D: [Signed a petition in support of something or against something ] Please indicate if you have done any of the following activites in the last 2 years.
Percentage
NaN     2.380952
No     58.190476
Yes    39.428571
dtype: float64

Respondents are more likely to have signed a petition or contacted an offical than to have hit the streets. This is as expected.

I’ve created a secondary variable indicating whether respondents have participated in any of the discussed forms of political participation. Respondents who have answered ‘Yes’ to any of the four political participation questions will be assigned a value ‘Yes’; those who have answered ‘No’ to all four questions will be assigned a value ‘No’ and those who have not answered ‘Yes’ to any of the questions but who have refused to answer at least one of the questions will be treated as missing.

ANY: Respondent has engaged in any of the four forms of political participation in the last 2 years
Percentage
NaN     2.190476
No     50.952381
Yes    46.857143
dtype: float64

The frequency table shows that almost half the respondents have engaged in any of the discussed forms of political participation in the last 2 years.

Finally a word on missing values. For all variables considered here, the percentage ‘refused’ is below 2.5%. This would seem sufficiently low not to expect any problems arising from this.

PS One of the students who reviewed my first assigment suggested I include ‘canvassing’ as a measure of political participation, which seems to make sense. Unfortunately the dataset doesn’t seem to include this aspect, but there are variables on other types of political participation that I may add in the future.


  1. «Employed wage and salary workers are classified as union members if they answer “yes” to the following question: On this job, are you a member of a labor union or of an employee association similar to a union? If the response is “no” to that question, then the interviewer asks a second question: On this job, are you covered by a union or employee association contract? If the response is “yes,” then these persons, along with those who responded “yes” to being union members, are classified as represented by a union. If the response is “no” to both the first and second questions, then they are classified as nonunion.»  ↩

Assignment 1-2

A short recap of the previous assignment: I’m using the Outlook on Life Surveys dataset and I’m interested in the relation between union membership and political participation (details here).

We’re required to write a programme that outputs frequency tables for a number of variables and discuss the output. The output’s supposed to be ‘interpretable (i.e. organized and labeled)’. I’m not entirely what is meant by that, but I’ve decided to recode the variables (e.g. 1 = ‘Yes’) and print the variable names and questions above the output. (If you’re logged in as a student, see the forum.)

The programme itself is posted here. Below I’ll discuss some of the output. For the sake of convenience, I’ll only show percentages (the raw counts can be obtained by running the programme). First, the current employment status of respondents.

PPWORK: Current Employment Status
Percentage
Not working - retired                           21.011334
Not working - on temporary layoff from a job     1.264167
Not working - looking for work                  10.854403
Not working - disabled                           8.456844
Not working - other                              6.451613
Working - self-employed                          6.190061
Working - as a paid employee                    45.771578
dtype: float64

One of the variables I’m interested in, is union membership. My understanding of the American situation is that union membership is often dependent on whether your workplace is organised (by contrast, in the Netherlands it’s not uncommon for unemployed or retired people to be union members). For that reason, it makes sense to look specifically at respondents who are working as paid employees. (The fact that union membership is measured at the household level complicates matters but that doesn’t change my preference to focus on paid employees.)

1,050 respondents (46%) are paid employees. This would seem to be a sufficiently large group for the purposes of the analyses I plan to do. In the programme, I created at subset of respondents who indicated they are working as paid employees. The output below is based on this subset.

Next, the numbers for the variable on union membership (as indicated, at the household level).

W1_P8: Does anyone in your household currently belong to a union?
Percentage
No         78.380952
Refused     1.142857
Yes        20.476190
dtype: float64

Within the subset of respondents with paid employment, little over 20% indicate that at least one person in their household is a union member. This compares to a union density of 11.1% among wage and salary workers in the US according to the Bureau of Labour Statistics.

Some of that difference can be explained by the fact that the 20% figure will include some respondents who aren’t union members themselves but who have someone in their household who is. On the other hand, the BLS is a bit more persistent in assessing union membership, and would likely classify some people as union members who wouldn’t be classified as such in the OOL surveys.[1] All in all, I’m inclined to say the 20% figure in the OOL surveys is higher than expected and that there is a possiblity that the survey sample is in some way biased towards union members.

And finally the political participation measures.

W1_L4_A: [Contacted a public official or agency ] Please indicate if you have done any of the following activities in the last 2 years.
Percentage
No         74.095238
Refused     2.190476
Yes        23.714286
dtype: float64
 
W1_L4_B: [Attended a protest meeting or demonstration ] Please indicate if you have done any of the following activites in the last 2 years.
Percentage
No         90.571429
Refused     2.190476
Yes         7.238095
dtype: float64
 
W1_L4_C: [Taken part in a neighborhood march ] Please indicate if you have done any of the following activites in the last 2 years.
Percentage
No         93.047619
Refused     2.095238
Yes         4.857143
dtype: float64
 
W1_L4_D: [Signed a petition in support of something or against something ] Please indicate if you have done any of the following activites in the last 2 years.
Percentage
No         58.190476
Refused     2.380952
Yes        39.428571
dtype: float64

Respondents are more likely to have signed a petition or contacted an offical than to have hit the streets. This is as expected.

Finally a word on missing values. For all variables considered here, the percentage ‘refused’ is below 2.5%. This would seem sufficiently low not to expect any problems arising from this.

PS One of the students who reviewed my first assigment suggested I include ‘canvassing’ as a measure of political participation, which seems to make sense. Unfortunately the dataset doesn’t seem to include this aspect, but there are variables on other types of political participation that I may add in the future.


  1. «Employed wage and salary workers are classified as union members if they answer “yes” to the following question: On this job, are you a member of a labor union or of an employee association similar to a union? If the response is “no” to that question, then the interviewer asks a second question: On this job, are you covered by a union or employee association contract? If the response is “yes,” then these persons, along with those who responded “yes” to being union members, are classified as represented by a union. If the response is “no” to both the first and second questions, then they are classified as nonunion.»  ↩

Pages