Some chatter from the internets

2016 Election

Question at hand: How will Obama's 46% approval rating effect his party's candidate for the 2016 presidential election?



How would you visualize this data?


Why is it ridiculous?

Inference for Regression

We can fit a line through any cloud of points that we please, but if we just have a sample of data, any trend we detect doesn't necessarily demonstrate that the trend exists in the population at large.

Plato's Allegory of the Cave

Statistical Inference

Goal: use statistics calculated from data to makes inferences about the nature of parameters.

In regression,

  • parameters: \(\beta_0\), \(\beta_1\)
  • statistics: \(b_0\), \(b_1\)

Classical tools of inference:

  • Confidence Intervals
  • Hypothesis Tests

Unemployment and elections

Reigning theory: voters will punish candidates from the Presidents party at the ballot box when unemployment is high.

Unemployment and elections

Reigning theory: voters will punish candidates from the Presidents party at the ballot box when unemployment is high.

Unemployment and elections

Some evidence of a negative linear relationship between unemployment level and change in party support - or is there?

H-test for Regression

\(H_0:\) There is no relationship between unemployment level and change in party support.

\(H_O: \beta_1 = 0\)

Method

If there is no relationship, the pairing between \(X\) and \(Y\) is artificial and we can randomize:

  1. Create synthetic data sets under \(H_0\) by shuffling \(X\).
  2. Compute a new regression line for each data set and store each \(b_1\).
  3. See where your observed \(b_1\) falls in the distribution of \(b_1\)'s under \(H_0\).

ump_shuffled$unemp <- sample(ump_shuffled$unemp)
qplot(x = unemp, y = change, col = party, data = ump_shuffled)

First \(b_1\)

Second \(b_1\)

100 \(b_1\)'s

Sampling dist. of \(b_1\)

H-tests for regression

m0 <- lm(change ~ unemp, data = ump)
summary(m0)
## 
## Call:
## lm(formula = change ~ unemp, data = ump)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.011  -7.861  -0.183   7.389  16.140 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   -6.714      5.457   -1.23     0.23
## unemp         -1.001      0.872   -1.15     0.26
## 
## Residual standard error: 9.11 on 25 degrees of freedom
## Multiple R-squared:  0.0501, Adjusted R-squared:  0.0121 
## F-statistic: 1.32 on 1 and 25 DF,  p-value: 0.262

H-tests for regression

  • Each line in the summary table is a hypothesis test that the parameter is zero.
  • Under certain conditions, the test statistic associated with \(b\)'s is distributed like \(t\) random variables with \(n - p\) degrees of freedom.

\[ \frac{b - \beta}{SE} \sim t_{df = n - p}\]

t_stat <- (-1.0010 - 0)/0.8717
pt(t_stat, df = 27 - 2) * 2
## [1] 0.262

Conditions for inference

  1. Linearity: linear trend between \(X\) and \(Y\), check with residual plot.
  2. Independent errors: check with residual plot for serial correlation.
  3. Normally distributed errors: check for linearity in qq-plot.
  4. Errors with constant variance: look for constant spread in residual plot.