Assignment

Assignment 3-3

In this assignment, we’re to report on a multiple regression analysis. I’m using the Outlook on Life (OOL) surveys dataset and I’ll look into the association between explanatory variables age and interest, and response variable interview duration.

The Python code for the analysis is here. The output of the regression analysis is here:

OLS regression model for the association between age and interview duration
                            OLS Regression Results                            
===
Dep. Variable:               duration   R-squared:                       0.004
Model:                            OLS   Adj. R-squared:                  0.003
Method:                 Least Squares   F-statistic:                     3.194
Date:                Fri, 27 May 2016   Prob (F-statistic):             0.0413
Time:                        17:13:49   Log-Likelihood:                -13650.
No. Observations:                1589   AIC:                         2.731e+04
Df Residuals:                    1586   BIC:                         2.732e+04
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
===
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
---
Intercept    246.9807     32.674      7.559      0.000       182.891   311.070
interested   -10.3983     29.356     -0.354      0.723       -67.978    47.182
age           -5.2787      2.114     -2.497      0.013        -9.426    -1.132
===
Omnibus:                     2263.566   Durbin-Watson:                   2.026
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           470888.323
Skew:                           8.346   Prob(JB):                         0.00
Kurtosis:                      85.665   Cond. No.                         16.2
===
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

First of all, the R-squared for the model is only 0.004, which means the model only explains a tiny share of variance in the response variable. This finding alone should be reason to see if the model can be improved.

The beta coefficient for the interest variable has a negative value (–10.4) which suggests that respondents who are more interested in politics tend to have shorter interview durations. However, the p value is 0.7 so nowhere near significance (which can also be concluded from the fact that the confidence interval for this coefficient includes the value 0). In other words, we can’t conclude from the available data that interest is associated with interview duration.

The beta coefficient for age is –5.3, which suggests that older respondents tend to spend less time on the interviews. The association is significant at the 0.05 level. However, the confidence interval suggests the actual coefficient is somewhere between –1.1 and –9.4, i.e. it’s not very precise.

Below are the plots for analysing the residuals.

The analyses of residuals (qq-plot and standardized residuals) suggest that the model can be substantially improved, for example by adding other explanatory variables. Finally, the influence plot suggests there are no observations with high leverage that are also outliers - so that’s not the cause why the model is so unsuccessful at explaining variance in interview duration.

Assignment 3-2

In this assignment, we’re required to report on a regression analysis with just one explanatory variable. If we’re using a categorical variable as explanatory variable, we’re to recode it to two categories with values 0 and 1. I’m using the Outlook On Life surveys dataset and I’ll look into the association between age and interview duration. My hypothesis is that duration will be shorter for younger respondents because they require less explanation.

Note that the mean duration of interviews is 245 minutes or over 4 hours, which is pretty long.

The python code for the analysis is available here. Here’s a frequency table for the recoded explanatory variable age_cats, where 0 = 18-44 and 1 = 45+:

1    1048
0     553
Name: age_cats, dtype: int64

And here’s the output of the regression analysis:

OLS regression model for the association between age and interview duration
                            OLS Regression Results                            
===
Dep. Variable:               duration   R-squared:                       0.005
Model:                            OLS   Adj. R-squared:                  0.004
Method:                 Least Squares   F-statistic:                     7.237
Date:                Sun, 22 May 2016   Prob (F-statistic):            0.00721
Time:                        09:15:58   Log-Likelihood:                -13746.
No. Observations:                1601   AIC:                         2.750e+04
Df Residuals:                    1599   BIC:                         2.751e+04
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
===
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
---
Intercept    365.2966     55.153      6.623      0.000       257.117   473.477
age_cats    -183.3910     68.169     -2.690      0.007      -317.100   -49.682
===
Omnibus:                     2283.566   Durbin-Watson:                   2.023
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           479121.259
Skew:                           8.367   Prob(JB):                         0.00
Kurtosis:                      86.080   Cond. No.                         3.16
===
 
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

First of all, there is a statistically significant association between age and interview duration (p < 0.01). However, the sign for the beta coefficient is negative, which means that interview duration is *shorter* for older respondents. My hypothesis is wrong. The beta coefficient is -183, which means that interviews for the older age category were on average over 3 hrs shorter than for the younger category. The intercept is 365 which implies that mean duration for younger respondents was about twice as long as for older respondents. This is quite a large difference that deserves further analysis.

Assignment 3-1

In this assignment we’re required to describe the sample, data collection and data management for the dataset we’re using. I’ve opted to use the Outlook On Life surveys dataset to look into the association between union membership and political participation (although I suspect I may have to alter the research question in future assignments). In the text below I’ve used a few quotes from the OOL website.

Sample

The target population were non-institutionalized adults 18 years of age and older, with a large oversample of Black ethnics. Participants were drawn from the GfK Knowledge Network, a web panel designed to be representative of the Unites States population. Panel members are randomly recruited through probability-based sampling, and households are provided with access to the Internet and hardware if needed. Random-digit dialing and address-based sampling methodologies are used.

A total of 2,294 respondents participated in this study; 1,601 were reinterviewed. For the analysis, I created a subset containing only respondents who are working as a paid employee. The level of analysis is individual.

Data collection

The purpose of the 2012 Outlook Surveys were to study political and social attitudes in the United States. The specific purpose of the survey is to consider the ways in which social class, ethnicity, marital status, feminism, religiosity, political orientation, and cultural beliefs or stereotypes influence opinion and behavior.

The data was collected through a web-based survey in the United States. The project included two surveys fielded between August and December 2012 using a sample from an Internet panel. Wave 1 was carried out between 16 August and 31 December 2012; Wave 2 between 13 December and 28 December 2012. The response rate was 55.3% for wave 1 and 75.1% for wave 2.

Data management

The explanatory variable measures whether anyone in the respondent’s household is a union member (values Yes, No and missing). For response variable, I created a variable which measures whether respondents have engaged during the past two years in any of the following forms of political participation: contact a public official or agency; attended a protest meeting or demonstration; taken part in a neighbourhood march; or signed a petition (values Yes, No and missing).

Assignment 2-4

In previous assignments I’ve looked into the association between union membership and political participation among paid employees, using the Outlook On Life surveys dataset. I found that respondents who have a union member in their household are more likely to have engaged in political participation over the past 2 years. This was consistent with what I expected on the basis of a study by Kerrissey and Schofer.

In the present assignment we’re to check for a potential moderator. The study by Kerrissey and Schofer found that the association between union membership and political participation is stronger for lower educated respondents, possibly because they have fewer other sources of political capital at their disposal.

Against this background I decided to test the association between union membership and political participation for different subgroups based on education. The OOL dataset has a variable with four education levels (less than high school; high school; some college; bachelor’s degree or higher). Since there are relatively few respondents with less than high school, I decided to lump together the first two categories.

First of all, here’s a grouped bar chart showing what percentage of respondents have engaged in political participation, by union membership (at household level) and by education level. Political participation levels appear higher for higher educated respondents, which will not come as a surprise. More surprisingly, the association between union membership and political participation appears stronger for higher educated respondents.

So let’s take a look at the chi squares for the different education levels. The entire Python script for my analysis can be found here. Below I copy some of the output from the code:

measure: "political_participation", group: "employees"
 
 
Results for "low"
union                     No  Yes
political_participation          
0.0                      158   36
1.0                       87   24
 
chi-square value, p value, expected counts
(0.24815891922850686, 0.61837443272471648, 1, array([[ 155.83606557,   38.16393443],
       [  89.16393443,   21.83606557]]))
 
 
Results for "medium"
union                     No  Yes
political_participation          
0.0                      130   26
1.0                      124   41
 
chi-square value, p value, expected counts
(2.7736422284672679, 0.095827887556796373, 1, array([[ 123.43925234,   32.56074766],
       [ 130.56074766,   34.43925234]]))
 
 
Results for "high"
union                     No  Yes
political_participation          
0.0                      157   27
1.0                      154   60
 
chi-square value, p value, expected counts
(9.5760783080978147, 0.0019712903653131314, 1, array([[ 143.77889447,   40.22110553],
       [ 167.22110553,   46.77889447]]))

The results show that the chi square value is smallest for the lowest education group and largest for the highest education group; and only significant for the highest education group (note that a post-hoc tests is not required because the explanatory variable has only two levels).

This comes as a surprise. Based on the study by Kerrissey and Schofer, I expected that the asssociation between union membership and political participation would be stronger for the lower educated respondents. However, using the OOL data, the association is only significant for the highest education level.

Note for students reviewing this assignment: the elaboration below isn’t strictly speaking part of the assignment. I wouldn’t want to waste your time so feel free to skip the rest of the article and make your assessment based on the text above.

I can’t really explain why my analysis leads to a result that seems at odds with the Kerrissey and Schofer study, but here are some considerations.

First of all, it’s entirely possible that I made some silly mistake in my analysis. And if that’s not the case, the method applied by Kerrissey and Schofer is different in a number of ways from my analysis. For example, they did regression analyses taking a number of relevant background variables into account. Further, they found a significant interaction between union membership and education in two different datasets. One could argue that Kerrissey and Schofer’s analysis is superior and their finding therefore more credible. Even so, it would be nice to be able to explain why a simpler model results in an opposite outcome.

Second, characteristics of respondents might play a role. I have the impression that union members may be overrepresented in the OOL dataset, but I don’t immediately see how that would explain the different outcome. More importantly, I did my analysis on a subset consisting of respondents with paid employment. It’s entirely possible that paid employees tend to be higher educated than unemployed and retired respondents. I guess it wouldn’t hurt rerunning the analysis on the entire group of respondents.

Third, it may matter how you define and measure political participation. I used a measure that includes contacting an official, participating in a protest or march and signing a petition. Kerrissey and Schofer found an interaction for voting, protest and membership. It would be interesting to see what happens if I use just the protest variable instead of the composite measure.

All respondents, composite measure

When I run my analysis on the entire group of respondents rather than just paid employees, the outcome changes in that there’s now a significant association between union membership and participation, not just for the highest education group, but also the medium education group. For the lowest education group, there’s still no significant association. So this doesn’t really explain the difference.

measure: "political_participation", group: "all_respondents"
 
 
Results for "low"
union                     No  Yes
political_participation          
0.0                      466   73
1.0                      279   60
 
chi-square value, p value, expected counts
(2.4819670432296759, 0.11515814971338957, 1, array([[ 457.35193622,   81.64806378],
       [ 287.64806378,   51.35193622]]))
 
 
Results for "medium"
union                     No  Yes
political_participation          
0.0                      262   32
1.0                      279   82
 
chi-square value, p value, expected counts
(14.963425544693942, 0.00010961532600357433, 1, array([[ 242.83053435,   51.16946565],
       [ 298.16946565,   62.83053435]]))
 
 
Results for "high"
union                     No  Yes
political_participation          
0.0                      237   34
1.0                      307   95
 
chi-square value, p value, expected counts
(12.134008533047874, 0.00049510583539001945, 1, array([[ 219.05497771,   51.94502229],
       [ 324.94502229,   77.05497771]]))

All respondents, protest measure

Using the protest measure rather than the composite participation measure, the association is once again only significant for the highest educated group.

measure: "protest_demo", group: "all_respondents"
 
 
Results for "low"
union          No  Yes
protest_demo          
0.0           695  119
1.0            49   15
 
chi-square value, p value, expected counts
(2.9184647984510526, 0.087571143400387977, 1, array([[ 689.76765376,  124.23234624],
       [  54.23234624,    9.76765376]]))
 
 
Results for "medium"
union          No  Yes
protest_demo          
0.0           499   99
1.0            43   15
 
chi-square value, p value, expected counts
(2.5743436452143076, 0.10860914232973193, 1, array([[ 494.07926829,  103.92073171],
       [  47.92073171,   10.07926829]]))
 
 
Results for "high"
union          No  Yes
protest_demo          
0.0           487  104
1.0            52   23
 
chi-square value, p value, expected counts
(6.5436232099967668, 0.010526075473228987, 1, array([[ 478.3018018,  112.6981982],
       [  60.6981982,   14.3018018]]))

After these additional analyses, it’s clear that it makes a difference whether you include respondents who are not paid employees, but I don’t think that fully accounts for the difference between the analysis using the OOL dataset and Kerrissey and Schofer’s analyses. Using a ‘protest’ variable instead of a broader composite measure of political participation also didn’t help clear things up. I’m afraid I still don’t really have an explanation for the different outcomes.

Assignment 2-3

In previous assignments I’ve been looking into the association between union membership and political participation (both categorical variables), using the Outlook On Life surveys. For our present assignment we’re to generate a correlation coefficient, so I had to use other variables. I decided to test whether younger respondents tend to have more positive views of Occupy Wallstreet.

Here’s the code:

# Import relevant libraries
import pandas
import numpy
import seaborn
import scipy
 
# Read data & print size of dataframe
data = pandas.read_csv('../../Data Management and Visualization/Data/ool_pds.csv', low_memory = False)
print (data.shape)
 
# Only variable W1_D16 contains missing values that need to be recoded
data['W1_D16'] = data['W1_D16'].replace(-1, numpy.nan).replace(998, numpy.nan)
 
sub = data[['PPAGE', 'W1_D16']].dropna()
scat = seaborn.regplot(x="PPAGE", y="W1_D16", fit_reg=True, data=sub)
 
print ('Association between age and opinion of OWS')
print (scipy.stats.pearsonr(sub['PPAGE'], sub['W1_D16']))

And here’s the output:

Association between age and opinion of OWS
(-0.050642104468121348, 0.030560880228248256)

There’s a negative and statistically significant (p < 0.05) correlation between age and opinions on OWS, so yes, younger people do seem to be likely to have a more positive view of OWS. However, the correlation coefficient is very small, -.05, which implies that age could explain a mere 0.25% of variation in opinions on OWS.

Pages