**Categorical Data: Comparing Two Proportions**

**Learning Objectives:**

- Be able to conduct a hypothesis test comparing two proportions
- Understand what a “pooled estimate” is for conducting these calculations
- Be able to interpret the results of a hypothesis test comparing two proportions

In class, we have already learned how to conduct a single sample hypothesis test for proportions. In those tests, we compared a single sample’s proportion to what we expected the true population proportion to be.

Today, we take it a step further. We want to compare the proportions of two samples in order to determine *whether there is a difference between the two populations from which they are taken. *

**Learning Objective 1: Conduct a Hypothesis Test Comparing Two Proportions**

To compare two populations, we can estimate the difference between their sample proportions. To compare population proportions $\pi_1$ and $\pi_2$, we can look at the difference between the sample proportions, $\hat{\pi}_1$ and $\hat{\pi}_2$. (Remember the sample proportion is denoted $\hat{\pi}$ while the true population proportion is denoted $\pi$).

The quantity $\hat{\pi}_2 – \hat{\pi}_1$ then becomes its own statistic with its own sampling distribution.

This is exactly the same logic as when we used $\hat{\pi}$ with one sample to estimate $\pi$. Just other estimators we work with, $\hat{\pi}_2 – \hat{\pi}_1$ has a sampling distribution that should be centered around the true population value of $\pi_2- \pi_1$.

Remember from our single sample hypothesis tests that when we perform a hypothesis test, we have a null hypothesis that tells us the true population proportion is equivalent to some hypothesized value. Through the hypothesis test, we are attempting to see how surprised we would be to obtain our sample proportion given that the null hypothesis were true.

When conducting a hypothesis test comparing two proportions, we use the same logic as before. We still go through the five parts of a hypothesis test. Here are the first three.

**Assumptions:**as always, we assume that the data are obtained randomly, and that the sample size is large enough that the sampling distribution of $\hat{\pi}_2 – \hat{\pi}_1$ is approximately normal.**Hypotheses:**We are trying to see if there is a difference between two population proportions. The null hypothesis is that there is no difference. The alternative hypothesis is that there is a difference. That is: $ H_0: \pi_2 – \pi_2 = 0$, or equivalently $\pi_2 = \pi_1$ ; $H_a: \pi_2 – \pi_1 \neq 0$ or $\pi_2 \neq \pi_2$**Calculate a Test Statistic:**Just like we did with the single sample hypothesis test, we want to calculate the number of standard errors away from the null hypothesis value our estimate is. We do this by using the z score (remembere we are assuming a large sample). $$ Z = \frac{\hat{\pi}_2 – \hat{\pi}_1 – 0}{se} $$

**Learning Objective 2: Understand what a “pooled estimate” is.**

Here, I will stop to explain how to calculate the standard error. Because we are assuming under our null hypothesis that there is no difference between $\pi_1$ and $\pi_2$, we have to calculate a *pooled estimate* of the sample proportion to use in our standard error. A pooled estimate estimates the common value of $\pi_1$ and $\pi_2$ by the sample proportion for the entire sample.

What does that mean though? Think about it with an example. Let’s say that we want to know whether or not there is a difference in the proportion of A’s in math class received by students who participated in a tutoring program and those who did not participate. There are 40 kids who did the tutoring program and 14 of them got A’s. There are 52 who did not do the tutoring program and 12 of them also got A’s. So proportion of A’s for tutored group = .35, proportion of A’s for not tutored group = .2307.

Because we are assuming that there is no difference between the two groups, the pooled estimate is the proportion of A’s received by all of the students.

$$\hat{\pi}_{pooled}=(14+12)/(40+52) = 26/92 = .2826 $$

If we are not given the raw numbers that make up the proportions, we can still calculate the pooled estimate as:

$$ \hat{\pi}_{pooled}=\frac{(\hat{\pi}_1) (n_1) + (\hat{\pi}_2)(n_2) }{n_1 + n_2} $$

The pooled estimate can be thought of as the sample proportion under the null hypothesis. It is $\hat{\pi}$ when we assume that the explanatory variable (in our example, the math tutoring program) has no effect.

Using this pooled estimate, we calculate the standard error with the formula

$$\sqrt{(se_1)^2 + (se_2)^2}$$

which is the same thing as

$$\sqrt{\frac{\hat{\pi}_{pooled} (1-\hat{\pi}_{pooled})}{n_1} +\frac{\hat{\pi}_{pooled} (1-\hat{\pi}_{pooled})}{n_2}}$$

Since $\hat{\pi}_{pooled}$ is the same, we can simplify our standard error formula under the null hypothesis to:

$$ \sqrt{\hat{\pi}_{pooled}(1-\hat{\pi}_{pooled}) \times (\frac{1}{n_1}+\frac{1}{n_2})} $$ (Remember that $\sqrt{\frac{\hat{\pi}(1-\hat{\pi})}{n}}$ is the formula for the standard error of a proportion).

**P- value.**The p-value is once again the two-tail probability, assuming that the null hypothesis is true and there is no difference between $\pi_1$ and $\pi_2$, of getting a z score that would exceed the observed z score in absolute value.

**Learning Objective 3: be able to interpret the results of a hypothesis test comparing two proportions**

**Conclusion:**Is our p value greater than or less than our alpha level? If it is greater than our alpha level, we fail to reject our null hypothesis that the two populations are different. If our p value is less than our alpha level we reject the null hypothesis.

The above link does a really good job (in under 3 minutes) of explaining in more detail the test statistic used in a difference of two proportions hypothesis test and the pooled estimate.