Inference for Means II

Naive analysis

Consider a data set containing the IQs of 36 men and the IQs of 36 women. Can use this data to test the following?

\[ H_0: \mu_{M} - \mu_{F} = 0 \\ H_A: \mu_{M} - \mu_{F} \ne 0 \]

##     IQ    sex
## 24 112   male
## 4  113   male
## 44 120 female
## 43 119 female
## 52 118 female
## 30 111   male

(ds <- d %>%
  group_by(sex) %>%
  summarize(mean = mean(IQ),
            s = sd(IQ),
            n = n()))

## # A tibble: 2 × 4
##      sex  mean     s     n
##   <fctr> <dbl> <dbl> <int>
## 1 female   118  6.50    36
## 2   male   115  3.48    36

Two sample t-test

We have a point estimate

diff(ds$mean)

## [1] -3.39

We can calculate a standard error

sqrt(ds$s[1]^2/sqrt(ds$n[1]) + ds$s[2]^2/sqrt(ds$n[2]))

## [1] 3.01

We can calculate the df

min(ds$n[1] - 1, ds$n[2] - 1)

## [1] 35

Two sample t-test (cont.)

But we need to check conditions
- Nearly normal populations (barplots looked OK)
- Independent observations

Original Data

Data were collected from schools in a large city on a set of thirty-six children who were identified as gifted children soon after they reached the age of four.

head(gifted)

##   score fatheriq motheriq speak count read edutv cartoons
## 1   159      115      117    18    26  1.9  3.00     2.00
## 2   164      117      113    20    37  2.5  1.75     3.25
## 3   154      115      118    20    32  2.2  2.75     2.50
## 4   157      113      131    12    24  1.7  2.75     2.25
## 5   156      110      109    17    34  2.2  2.25     2.50
## 6   150      113      109    13    28  1.9  1.25     3.75

Paired data

If there is a natural pairing between observations in two groups of size n, it can make more sense to analyze them as a single sample of n differences.

gifted %>%
  mutate(diff = fatheriq - motheriq) %>%
  select(fatheriq, motheriq, diff)

##    fatheriq motheriq diff
## 1       115      117   -2
## 2       117      113    4
## 3       115      118   -3
## 4       113      131  -18
## 5       110      109    1
## 6       113      109    4
## 7       118      119   -1
## 8       117      120   -3
## 9       111      128  -17
## 10      122      120    2
## 11      111      117   -6
## 12      112      120   -8
## 13      119      126   -7
## 14      120      114    6
## 15      114      129  -15
## 16      111      118   -7
## 17      111      115   -4
## 18      115      111    4
## 19      126      111   15
## 20      115      109    6
## 21      114      124  -10
## 22      115      122   -7
## 23      115      118   -3
## 24      112      121   -9
## 25      115      124   -9
## 26      117      118   -1
## 27      116      128  -12
## 28      114      119   -5
## 29      116      123   -7
## 30      111      117   -6
## 31      112      117   -5
## 32      115      111    4
## 33      111      101   10
## 34      119      113    6
## 35      111      121  -10
## 36      114      123   -9

Paired t-test

\[ H_0: \mu_{diff} = 0 \\ H_A: \mu_{diff} \ne 0 \]

Check conditions

Independent observations
Nearly normal population

Paired t-test (cont.)

Compute a test statistic

(gs <- gifted %>%
  mutate(diff = fatheriq - motheriq) %>%
  summarize(mean = mean(diff), s = sd(diff), n = n()))

##    mean    s  n
## 1 -3.39 7.45 36

(t_obs <- (gs$mean - 0)/(gs$s/sqrt(gs$n)))

## [1] -2.73

\(df = n - 1\)

Paired compared

gs$s/sqrt(gs$n)

## [1] 1.24

sqrt(ds$s[1]^2/sqrt(ds$n[1]) + ds$s[2]^2/sqrt(ds$n[2]))

## [1] 3.01

While the point estimate is the same in the paired and independent tests, if the data is paired, the dependency leads to a smaller SE.

This principle is widely used in experiment design, e.g. pre- and post-test.