In this assignment, we’re required to report on a regression analysis with just one explanatory variable. If we’re using a categorical variable as explanatory variable, we’re to recode it to two categories with values 0 and 1. I’m using the Outlook On Life surveys dataset and I’ll look into the association between age and interview duration. My hypothesis is that duration will be shorter for younger respondents because they require less explanation.
Note that the mean duration of interviews is 245 minutes or over 4 hours, which is pretty long.
The python code for the analysis is available here. Here’s a frequency table for the recoded explanatory variable age_cats, where 0 = 18-44 and 1 = 45+:
Name: age_cats, dtype: int64
And here’s the output of the regression analysis:
OLS regression model for the association between age and interview duration
OLS Regression Results
Dep. Variable: duration R-squared: 0.005
Model: OLS Adj. R-squared: 0.004
Method: Least Squares F-statistic: 7.237
Date: Sun, 22 May 2016 Prob (F-statistic): 0.00721
Time: 09:15:58 Log-Likelihood: -13746.
No. Observations: 1601 AIC: 2.750e+04
Df Residuals: 1599 BIC: 2.751e+04
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 365.2966 55.153 6.623 0.000 257.117 473.477
age_cats -183.3910 68.169 -2.690 0.007 -317.100 -49.682
Omnibus: 2283.566 Durbin-Watson: 2.023
Prob(Omnibus): 0.000 Jarque-Bera (JB): 479121.259
Skew: 8.367 Prob(JB): 0.00
Kurtosis: 86.080 Cond. No. 3.16
 Standard Errors assume that the covariance matrix of the errors is correctly specified.
First of all, there is a statistically significant association between age and interview duration (
p ). However, the sign for the beta coefficient is negative, which means that interview duration is *shorter* for older respondents. My hypothesis is wrong. The beta coefficient is -183, which means that interviews for the older age category were on average over 3 hrs shorter than for the younger category. The intercept is 365 which implies that mean duration for younger respondents was about twice as long as for older respondents. This is quite a large difference that deserves further analysis.