Assignment 2-1

5 April 2016

In this assignment, we’re asked to run an analysis of variance and then conduct post hoc paired comparisons. In earlier assignments I looked into the association between trade union membership and political participation, using the OOL Surveys dataset. The variables I considered are not suitable for the present assignment, so I’ll pick a different issue for now: possible regional variation in opinions about unionised workers (summary statistics about the latter variable here).


Based on anecdotal evidence about successful union campaigns in the US - from Justice for Janitors to the Fight for 15 - I have the impression that unions are more active in some places (e.g. LA, San Francisco, Seattle, New York) than others. If this is correct, then I assume it’s possible that other aspects of trade unionism, such as union density and opinions of unionised workers, may also show regional variation.

By way of initial exploration, I created a map that shows the share of workers who are union members by state (I wrote Python scripts to scrape the data from a Wikipedia page and to modify this svg-map). Hover your mouse over the map.

Percentage of workers who are union members (Wikipedia); darker green represents higher density

The map suggests that union membership tends to be higher in states along the west coast and in the northeast of the US. Note that there could be overlap between these states and states where metropolitan areas are concentrated.


While I think it’s more practical to link to a separate code file, the assignment says we should paste the code into the article, so here it is:

# Import relevant libraries
import pandas
import numpy
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi

# Read data & print size of dataframe
data = pandas.read_csv('../../Data Management and Visualization/Data/ool_pds.csv', low_memory = False)
print (data.shape)

# Only variable W1_N1H contains missing values that need to be recoded
data['W1_N1H'] = data['W1_N1H'].replace(-1, numpy.nan).replace(998, numpy.nan)

print('ANOVA to compare means by MSA [metro] status')
model1 = smf.ols(formula = 'W1_N1H ~ C(PPMSACAT)', data = data)
results1 =
sub1 = data[['W1_N1H','PPMSACAT']].dropna()
grouped = sub1.groupby('PPMSACAT')
sub2 = grouped['W1_N1H'].agg([numpy.median, numpy.mean, numpy.std, len])

print('Explore state-level data')
sub1 = data[['W1_N1H','PPSTATEN']].dropna()
grouped = sub1.groupby('PPSTATEN')
sub2 = grouped['W1_N1H'].agg([numpy.median, numpy.mean, numpy.std, len])

# Create subset including only respondents from states with at least 50 respondents
counts = dict(data['PPSTATEN'].value_counts())
include_states = [state for state in counts if counts[state] >= 50]
sub3 = data[data['PPSTATEN'].isin(include_states)]
sub4 = sub3.copy()
recode = {21: 'NY', 22: 'NJ', 23: 'PA', 31: 'OH', 33: 'IL', 34: 'MI', 43: 'MO', 52: 'MD', 54: 'VA', 56: 'NC', 58: 'GA', 59: 'FL', 62: 'TN', 63: 'AL', 74: 'TX', 93: 'CA'}
sub4['PPSTATEN'] = sub4['PPSTATEN'].map(recode)

print('ANOVA to compare means by state for states with at least 50 respondents')
model2 = smf.ols(formula = 'W1_N1H ~ C(PPSTATEN)', data = sub4)
results2 =

print('Post-hoc test [HSD] for state means')
sub5 = sub4[['W1_N1H', 'PPSTATEN']].dropna()
mc1 = multi.MultiComparison(sub5['W1_N1H'], sub5['PPSTATEN'])
res1 = mc1.tukeyhsd()

print('Print summaries per state')
grouped = sub5.groupby('PPSTATEN')
sub6 = grouped['W1_N1H'].agg([numpy.median, numpy.mean, numpy.std, len])


All the code output can be found here.

First, I carried out an ANOVA to see whether opinions about union members tend to vary between metro and non-metro areas. There is a statistically significant difference (F = 7, p < 0.01), although the variance explained is small. Opinions about unionised workers are somewhat more favourable in metro areas (mean score 62 out of 100) than in non-metro areas (57). Since the metro variable has only 2 levels, there’s no need for a post-hoc test.

Exploratory analysis of the state variable reveals that a substantial number of states have very few respondents. I decided to create a subset consisting only of respondents in states with at least 50 respondents (I’ll admit I’m not sure whether this threshold makes sense and how to decide this).

Note that this is a somewhat unbalanced subset: most states that are included are from the eastern part of the US, with California and Texas as major exceptions. This probably makes sense if you’d look at the population size of states, but still it’s something to keep in mind when interpreting the findings.

The conclusion of an ANOVA testing differences in opinions between states is that there are indeed differences in mean opinion about unionised workers per state (F = 2.4, p < 0.01), although again the variance explained is quite small.

A post-hoc test (HSD) reveals that the differences can be attributed to the divergent position of Florida within this sample of states: respondents from Florida on average have less favourable opinions (a mean score of 54) than those from Illinois (67), Michigan (68), Missouri (69) and New York (66). The average rating by respondents from Texas (57) turns out not to be different from other states from the sample - at least not at a statistically significant level.

5 April 2016 | Categories: assignment, assignments, dai, data, trade union