Problem Set 7

Exercise 3.1

Suppose that \(8\%\) of college students are vegetarians.

False. Here we want to use the criteria that \[ \begin{align} n p &\geq 10 \\ n (1-p) &\geq 10 \end{align} \] With \(n=60\), the first condition is not satisfied. \[ \begin{align} n p &= 60 \times 0.08 \\ &=4.8 \end{align} \]
True. Because the body of the distribution (around \(p\)) is so close to the lower bound, the distribution will be skewed. The figure in exercise 2.11 on page 115 shows an example of such a sampling distribution. As the number of samples in the distribution increases, the spread of the sampling distribution becomes small with respect to the bounds, and is more symetrically distributed around \(p\).
We can construct confidence intervals around our point estimate (\(\hat{p} = 12\%\)). Note, that this situation just meets our criteria discussed in part a (\(125 \times 0.08=10\)). For any arbitrary estimator,\(\hat{\theta}\), the confidence interval is \(\hat{\theta} \pm z^{*} \times SE\), where \(z^{*}\) is our confidence level, and SE is the standard error of the sampling distribution. Since we are working with a single proportion the standard error is \[ SE = \sqrt{\frac{p(1-p)}{n}} \] For \(n=125\) and \(p=0.12\) the standard error is \[ \begin{align} SE &= \sqrt{\frac{p(1-p)}{n}} \\ &= \sqrt{\frac{0.12 \times (1-0.12)}{125}} \\ &= 0.029 \end{align} \]

This gives us \(95\%\) confidence intervals of \[ \begin{align} \hat{p} &\pm z^{*} \times SE \\ \hat{p} &\pm 1.96 \times 0.029 \\ \hat{p} &\pm 0.057~{\rm or} \\ (0&.063, 0.177) \end{align} \] This range overlaps with the null proportion of \(8\%\). Thus, we do not consider out point estimate to be unusual.

Or with a Hypotheis Test \[ \begin{align} H_{0} &= p = 0.08 \\ H_{A} &= p \neq 0.08 \end{align} \] With a z-score \[ \begin{align} z &= \frac{(\hat{p} - 0.08)}{SE} \\ &= \frac{(\hat{p} - 0.08)}{ \sqrt{ \frac{\hat{p} (1-\hat{p})}{n} } } \\ &= \frac{(0.12 - 0.08)}{ \sqrt{ \frac{0.12 (1-0.12)}{125} } } \\ &= 1.38 \end{align} \] With this z-score we should fail to reject the null hypothesis, i.e. this value is not unusual.

Now we have a smaple with \(n=250\) and \(p=0.12\). We estimate the standard error the same way as part (c) \[ \begin{align} SE &= \sqrt{\frac{p(1-p)}{n}} \\ &= \sqrt{\frac{0.12 \times (1-0.12)}{250}} \\ &= 0.021 \end{align} \] And we obtain the corresponding (\(95\%\)) confidence intervals. \[ \begin{align} \hat{p} &\pm z^{*} \times SE \\ \hat{p} &\pm 1.96 \times 0.021 \\ \hat{p} &\pm 0.040~{\rm or} \\ (0&.08, 0.16) \end{align} \] The lower bound of our confidence interval falls right on the value of our parameter. So, although this is not the strongest evidence that this observation is unusual, we should be suspicious that it could be.

Or using the same hypotheis test as part (c), we can calculate a z-score With a z-score \[ \begin{align} z &= \frac{(\hat{p} - 0.08)}{SE} \\ &= \frac{(\hat{p} - 0.08)}{ \sqrt{ \frac{\hat{p} (1-\hat{p})}{n} } } \\ &= \frac{(0.12 - 0.08)}{ \sqrt{ \frac{0.12 (1-0.12)}{255} } } \\ &= 1.38 \end{align} \] This puts the observation in the 0.974th percentile, for a two sided p-value of 0.07. In other words, suggestive, but not convincing that this observation is unusual.

False. As we can see, assuming that both samples are drawn from a population with the same know parameter, \(p\), \[ \begin{align} \frac{SE_{1}}{SE_{2}} &= \frac{\sqrt{\frac{p(1-p)}{n_{1}}}}{\sqrt{\frac{p(1-p)}{n_{2}}}} \\ &= \frac{\sqrt{n_{2}}}{\sqrt{n_{1}}} \\ &= \sqrt{\frac{n_{2}}{n_{1}}} \end{align} \]

The constant of proportionality that relates the two standard errors is the square root of the ratio of the sample sizes (\(\sqrt{\frac{n_{2}}{n_{1}}}\)), not the ratio of the sample sizes (\(\frac{n_{2}}{n_{1}}\)). In the case of parts (c) \(\&\) (d) the standard error would be reduced by a factor of \(\frac{1}{\sqrt{2}} = 0.707\).

Exercise 3.35

This problem looks at a randomized drug trial for HIV positive women giving birth. The sample size is \(n_{Tot} = 240\) with \(n_{Nev}=120\) and \(n_{Lop}=120\). We also know the the counts of virologic failure were \(n_{Nev, failure} =26\) and \(n_{Lop, failure}=10\).

We can present these results in a two way table

Drug	Failure	No Failure
Nevaripine	26	94
Lopinavir	10	120

If we want to test for independence of treatment and virolgic failure then we want to see if the proportions are the same for each group or not, so our hypotheses will be: \[ \begin{align} H_{0} &= p_{Nev} - p_{Lop} = 0 \\ H_{A} &= p_{Nev} - p_{Lop} \neq 0 \end{align} \]
The conditions for two proportions are the same as for one proportion and both sample must satisfy the conditions. Since \(p_{Nev}=\frac{26}{120} = 0.22\), then \(np=120*.22=26\) and \(p_{Lop}=\frac{10}{120} = 0.08\), with \(np=120*.08=9.6\sim10\). We will call these conditions (barely) satisfied, and use a z-test.

\[ \begin{align} z &= \frac{(\hat{p}_{Nev} - \hat{p}_{Lop}) - (p_{Nev} - p_{Lop})}{SE} \\ &= \frac{(\hat{p}_{Nev} - \hat{p}_{Lop})}{\sqrt{\frac{p_{Nev}(1-p_{Nev})}{n_{Nev}} + \frac{p_{Lop}(1-p_{Lop})}{n_{Lop}}}} \end{align} \]

Now at this stage, \(p_{Nev}\) and \(p_{Lop}\) can be replaced with \(\hat{p}_{Nev}\) and \(\hat{p}_{Lop}\) in the standard error or the pooled standard error can be used. \[ SE = \sqrt{\frac{\hat{p}_{pooled}(1-\hat{p}_{pooled})}{n_{1}+n_{2}}} \\ {\rm with} \\ \hat{p}_{pooled} = \frac{\hat{p}_{1} n_{1} + \hat{p}_{2} n_{2}}{n_{1}+n_{2}} \\ \] If we use the former, \[ \begin{align} z &= \frac{(0.22 - 0.08)}{\sqrt{\frac{0.22(1-0.22)}{120} + \frac{0.08(1-0.08)}{120}}} \\ &= 3.1 \end{align} \] With a z-score of 3.1, there is a probability of \((1-0.999) \times 2 \sim 0.002\) of seeing the difference in proportions that we see if the null hypothesis was true. Therefore, we reject the null hypothesis and conclude that thr proportion of virologic failure does depend on treatment.

Exercise 3.37

False, the \(\chi^{2}\) distribution has one parameter, degrees of freedom (\(k-1\), where \(k\) is the number of caetegories).
True, the degree of its skewness decreases as the degrees of freedom increase.
True, the \(\chi^{2}\) statistic is the sum of a series of squared numbers, hence it is (almost) always positive. (Also accept with increased credit: False, if the observed and expected counts exactly match in every category than \(\chi^{2}=0\), so it is always non-negative, but may not always be positive).
False, as the degrees of freedom increase the \(\chi^{2}\) distribution becomes less skewed. See, also, part (b) of this exercise.

Exercise 3.39

They hypothesis are: \[ \begin{align} H_{0} &: p_{purchase} = 0.60, p_{print~web} = 0.25, p_{read~online}=0.15 \\ H_{A} &: {\rm At~least~one~proportion~is~different} \end{align} \]
The professor expected \(n \times p_{j}\) students in each of the \(j\) categories, with \(n=126\), thus: \[ \begin{align} Expected_{purchase} &= 126 \times 0.60 = 75.6\\ Expected_{print~web} &= 126 \times 0.25 = 31.5 \\ Expected_{read~online} &= 126 \times 0.15 = 18.9 \\ \end{align} \]
The conditions for a \(\chi^{2}\) test are
1. Independent observations
2. Each cell has \(\geq\) 5 counts
3. There are three or more categories (for two we would use difference in props)

The counts in each cell are greater than 5, there are three categories, and independence seems reasonable.

The \(\chi^{2}\) statistic is: \[ \chi^2 = \sum_{j=1}^{k} \frac{(Obs_{j}-Exp_{j})^{2}}{Exp_{j}} \] Which in this case is \[ \begin{align} \chi^2 &= \sum_{j=1}^{k} \frac{(Obs_{j}-Exp_{j})^{2}}{Exp_{j}} \\ &=\frac{(71-75.6)^{2}}{75.6} + \frac{(30-31.5)^{2}}{31.5} + \frac{(25-18.9)^{2}}{18.9} \\ &= 0.28 + 0.07 + 1.97 &= 2.32 \end{align} \] Since there are three categories, there are two degrees of freedom. The p-value associated with this \(\chi^{2}\) value with two degrees of freedom is greater than 0.3.
Based on the above p-value, we fail to reject the null hypothesis. That is, if the null hypothesis were true, we would expect to see the proportions similar to what the professor sees in class.

Exercise 3.42

The approriate test in this case is a \(\chi^{2}\) test for independence.
The competing hypotheses are: \[ \begin{align} H_{0} &: p_{j} = \frac{n_{column~j}}{n_{total}}~\forall~j~{\rm columns} \\ H_{A} &: {\rm At~least~one~proportion~is~different~from~expected} \end{align} \]
The overall proportion of women with clinical depression is \(P(Yes~Depression) = \frac{2607}{50739} = 0.051\). Thus, the proportion of women without clinical depression is \(P(No~Depression) = \frac{48132}{50739} = 0.949\)
We know that the expected proportion for this cell is \[ \begin{align} n_{i} \times p_{j} &= \frac{n_{column~j}}{n_{total}} \times n_{row~i} \\ &= \frac{6617}{50739} \times 2607 \\ &= 339.99 \end{align} \] The contribution to the total \(\chi^{2}\) from this cell is \[ \begin{align} z^{2} &= \frac{(Observed - Expected)^{2}}{Expected} \\ &= \frac{(373-399.99)^{2}}{399.99} \\ &= 1.82 \end{align} \]
We can compare the \(\chi^{2}\) = 20.93 to a \(\chi^{2}\) distribution to get the p-value. We have two rows and five columns for four degrees of freedom. \[ \begin{align} df &= (n_{rows} - 1) \times (n_{columns} - 1) \\ &= (2 - 1) \times (5 - 1) \\ &= 4 \end{align} \] To reject the null hypothesis with a significance of \(0.001\) we would need a \(\chi^{2}\) of at least 18.47. It looks like we have that.
We reject the null hypothesis.
Yes, this study tells us that clinical depression is not independent of the amount of coffee. The full nature of this relationship, and therefore health policy recommendations will take further study.