In this assignment, we’re to report on a multiple regression analysis. I’m using the Outlook on Life (OOL) surveys dataset and I’ll look into the association between explanatory variables age and interest, and response variable interview duration.
The Python code for the analysis is here. The output of the regression analysis is here:
OLS regression model for the association between age and interview duration
OLS Regression Results
Dep. Variable: duration R-squared: 0.004
Model: OLS Adj. R-squared: 0.003
Method: Least Squares F-statistic: 3.194
Date: Fri, 27 May 2016 Prob (F-statistic): 0.0413
Time: 17:13:49 Log-Likelihood: -13650.
No. Observations: 1589 AIC: 2.731e+04
Df Residuals: 1586 BIC: 2.732e+04
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [95.0% Conf. Int.]
Intercept 246.9807 32.674 7.559 0.000 182.891 311.070
interested -10.3983 29.356 -0.354 0.723 -67.978 47.182
age -5.2787 2.114 -2.497 0.013 -9.426 -1.132
Omnibus: 2263.566 Durbin-Watson: 2.026
Prob(Omnibus): 0.000 Jarque-Bera (JB): 470888.323
Skew: 8.346 Prob(JB): 0.00
Kurtosis: 85.665 Cond. No. 16.2
 Standard Errors assume that the covariance matrix of the errors is correctly specified.
First of all, the R-squared for the model is only 0.004, which means the model only explains a tiny share of variance in the response variable. This finding alone should be reason to see if the model can be improved.
The beta coefficient for the interest variable has a negative value (–10.4) which suggests that respondents who are more interested in politics tend to have shorter interview durations. However, the p value is 0.7 so nowhere near significance (which can also be concluded from the fact that the confidence interval for this coefficient includes the value 0). In other words, we can’t conclude from the available data that interest is associated with interview duration.
The beta coefficient for age is –5.3, which suggests that older respondents tend to spend less time on the interviews. The association is significant at the 0.05 level. However, the confidence interval suggests the actual coefficient is somewhere between –1.1 and –9.4, i.e. it’s not very precise.
Below are the plots for analysing the residuals.
The analyses of residuals (qq-plot and standardized residuals) suggest that the model can be substantially improved, for example by adding other explanatory variables. Finally, the influence plot suggests there are no observations with high leverage that are also outliers - so that’s not the cause why the model is so unsuccessful at explaining variance in interview duration.