The two-sample t-test (also known as the independent samples t-test) is a method used to test whether the unknown population means of two groups are equal or not.
Yes, a two-sample t-test is used to analyze the results from A/B tests.
You can use the test when your data values are independent, are randomly sampled from two normal populations and the two independent groups have equal variances.
Use a multiple comparison method. Analysis of variance (ANOVA) is one such method. Other multiple comparison methods include the Tukey-Kramer test of all pairwise differences, analysis of means (ANOM) to compare group means to the overall mean or Dunnett’s test to compare each group mean to a control mean.
You can still use the two-sample t-test. You use a different estimate of the standard deviation.
If your sample sizes are very small, you might not be able to test for normality. You might need to rely on your understanding of the data. When you cannot safely assume normality, you can perform a nonparametric test that doesn’t assume normality.
The sections below discuss what is needed to perform the test, checking our data, how to perform the test and statistical details.
For the two-sample t-test, we need two variables. One variable defines the two groups. The second variable is the measurement of interest.
We also have an idea, or hypothesis, that the means of the underlying populations for the two groups are different. Here are a couple of examples:
To conduct a valid test:
For very small groups of data, it can be hard to test these requirements. Below, we'll discuss how to check the requirements using software and what to do when a requirement isn’t met.
One way to measure a person’s fitness is to measure their body fat percentage. Average body fat percentages vary by age, but according to some guidelines, the normal range for men is 15-20% body fat, and the normal range for women is 20-25% body fat.
Our sample data is from a group of men and women who did workouts at a gym three times a week for a year. Then, their trainer measured the body fat. The table below shows the data.
You can clearly see some overlap in the body fat measurements for the men and women in our sample, but also some differences. Just by looking at the data, it's hard to draw any solid conclusions about whether the underlying populations of men and women at the gym have the same mean body fat. That is the value of statistical tests – they provide a common, statistically valid way to make decisions, so that everyone makes the same decision on the same set of data values.
Let’s start by answering: Is the two-sample t-test an appropriate method to evaluate the difference in body fat between men and women?
Before jumping into analysis, we should always take a quick look at the data. The figure below shows histograms and summary statistics for the men and women.
Figure 1: Histogram and summary statistics for the body fat dataThe two histograms are on the same scale. From a quick look, we can see that there are no very unusual points, or outliers. The data look roughly bell-shaped, so our initial idea of a normal distribution seems reasonable.
Examining the summary statistics, we see that the standard deviations are similar. This supports the idea of equal variances. We can also check this using a test for variances.
Based on these observations, the two-sample t-test appears to be an appropriate method to test for a difference in means.
For each group, we need the average, standard deviation and sample size. These are shown in the table below.
Group | Sample Size (n) | Average (X-bar) | Standard deviation (s) |
Women | 10 | 22.29 | 5.32 |
Men | 13 | 14.95 | 6.84 |
Without doing any testing, we can see that the averages for men and women in our samples are not the same. But how different are they? Are the averages “close enough” for us to conclude that mean body fat is the same for the larger population of men and women at the gym? Or are the averages too different for us to make this conclusion?
We'll further explain the principles underlying the two sample t-test in the statistical details section below, but let's first proceed through the steps from beginning to end. We start by calculating our test statistic. This calculation begins with finding the difference between the two averages:
$ 22.29 - 14.95 = 7.34 $
This difference in our samples estimates the difference between the population means for the two groups.
Next, we calculate the pooled standard deviation. This builds a combined estimate of the overall standard deviation. The estimate adjusts for different group sizes. First, we calculate the pooled variance:
Next, we take the square root of the pooled variance to get the pooled standard deviation. This is:
We now have all the pieces for our test statistic. We have the difference of the averages, the pooled standard deviation and the sample sizes. We calculate our test statistic as follows:
To evaluate the difference between the means in order to make a decision about our gym programs, we compare the test statistic to a theoretical value from the t-distribution. This activity involves four steps:
To find this value, we need the significance level (α = 0.05) and the degrees of freedom. The degrees of freedom (df) are based on the sample sizes of the two groups. For the body fat data, this is:
$ df = n_1 + n_2 - 2 = 10 + 13 - 2 = 21 $
Let’s look at the body fat data and the two-sample t-test using statistical terms.
Our null hypothesis is that the underlying population means are the same. The null hypothesis is written as:
The alternative hypothesis is that the means are not equal. This is written as:
We calculate the average for each group, and then calculate the difference between the two averages. This is written as:
We calculate the pooled standard deviation. This assumes that the underlying population variances are equal. The pooled variance formula is written as:
The formula shows the sample size for the first group as n1 and the second group as n2. The standard deviations for the two groups are s1 and s2. This estimate allows the two groups to have different numbers of observations. The pooled standard deviation is the square root of the variance and is written as sp.
What if your sample sizes for the two groups are the same? In this situation, the pooled estimate of variance is simply the average of the variances for the two groups:
The test statistic is calculated as:
The numerator of the test statistic is the difference between the two group averages. It estimates the difference between the two unknown population means. The denominator is an estimate of the standard error of the difference between the two unknown population means.
Technical Detail: For a single mean, the standard error is $ s/\sqrt $ . The formula above extends this idea to two groups that use a pooled estimate for s (standard deviation), and that can have different group sizes.
We then compare the test statistic to a t value with our chosen alpha value and the degrees of freedom for our data. Using the body fat data as an example, we set α = 0.05. The degrees of freedom (df) are based on the group sizes and are calculated as:
$ df = n_1 + n_2 - 2 = 10 + 13 - 2 = 21 $
The formula shows the sample size for the first group as n1 and the second group as n2. Statisticians write the t value with α = 0.05 and 21 degrees of freedom as:
The t value with α = 0.05 and 21 degrees of freedom is 2.080. There are two possible results from our comparison:
When the variances for the two groups are not equal, we cannot use the pooled estimate of standard deviation. Instead, we take the standard error for each group separately. The test statistic is:
The numerator of the test statistic is the same. It is the difference between the averages of the two groups. The denominator is an estimate of the overall standard error of the difference between means. It is based on the separate standard error for each group.
The degrees of freedom calculation for the t value is more complex with unequal variances than equal variances and is usually left up to statistical software packages. The key point to remember is that if you cannot use the pooled estimate of standard deviation, then you cannot use the simple formula for the degrees of freedom.
The normality assumption is more important when the two groups have small sample sizes than for larger sample sizes.
Normal distributions are symmetric, which means they are “even” on both sides of the center. Normal distributions do not have extreme values, or outliers. You can check these two features of a normal distribution with graphs. Earlier, we decided that the body fat data was “close enough” to normal to go ahead with the assumption of normality. The figure below shows a normal quantile plot for men and women, and supports our decision.
Figure 2: Normal quantile plot of the body fat measurements for men and womenYou can also perform a formal test for normality using software. The figure above shows results of testing for normality with JMP software. We test each group separately. Both the test for men and the test for women show that we cannot reject the hypothesis of a normal distribution. We can go ahead with the assumption that the body fat data for men and for women are normally distributed.
Testing for unequal variances is complex. We won’t show the calculations in detail, but will show the results from JMP software. The figure below shows results of a test for unequal variances for the body fat data.
Figure 3: Test for unequal variances for the body fat dataWithout diving into details of the different types of tests for unequal variances, we will use the F test. Before testing, we decide to accept a 10% risk of concluding the variances are equal when they are not. This means we have set α = 0.10.
Like most statistical software, JMP shows the p-value for a test. This is the likelihood of finding a more extreme value for the test statistic than the one observed. It’s difficult to calculate by hand. For the figure above, with the F test statistic of 1.654, the p-value is 0.4561. This is larger than our α value: 0.4561 > 0.10. We fail to reject the hypothesis of equal variances. In practical terms, we can go ahead with the two-sample t-test with the assumption of equal variances for the two groups.
Using a visual, you can check to see if your test statistic is a more extreme value in the distribution. The figure below shows a t-distribution with 21 degrees of freedom.
Figure 4: t-distribution with 21 degrees of freedom and α = .05Since our test is two-sided and we have set α = .05, the figure shows that the value of 2.080 “cuts off” 2.5% of the data in each of the two tails. Only 5% of the data overall is further out in the tails than 2.080. Because our test statistic of 2.80 is beyond the cut-off point, we reject the null hypothesis of equal means.
The figure below shows results for the two-sample t-test for the body fat data from JMP software.
Figure 5: Results for the two-sample t-test from JMP softwareThe results for the two-sample t-test that assumes equal variances are the same as our calculations earlier. The test statistic is 2.79996. The software shows results for a two-sided test and for one-sided tests. The two-sided test is what we want (Prob > |t|). Our null hypothesis is that the mean body fat for men and women is equal. Our alternative hypothesis is that the mean body fat is not equal. The one-sided tests are for one-sided alternative hypotheses – for example, for a null hypothesis that mean body fat for men is less than that for women.
We can reject the hypothesis of equal mean body fat for the two groups and conclude that we have evidence body fat differs in the population between men and women. The software shows a p-value of 0.0107. We decided on a 5% risk of concluding the mean body fat for men and women are different, when they are not. It is important to make this decision before doing the statistical test.
The figure also shows the results for the t-test that does not assume equal variances. This test does not use the pooled estimate of the standard deviation. As was mentioned above, this test also has a complex formula for degrees of freedom. You can see that the degrees of freedom are 20.9888. The software shows a p-value of 0.0086. Again, with our decision of a 5% risk, we can reject the null hypothesis of equal mean body fat for men and women.
If you have more than two independent groups, you cannot use the two-sample t-test. You should use a multiple comparison method. ANOVA, or analysis of variance, is one such method. Other multiple comparison methods include the Tukey-Kramer test of all pairwise differences, analysis of means (ANOM) to compare group means to the overall mean or Dunnett’s test to compare each group mean to a control mean.
If your sample size is very small, it might be hard to test for normality. In this situation, you might need to use your understanding of the measurements. For example, for the body fat data, the trainer knows that the underlying distribution of body fat is normally distributed. Even for a very small sample, the trainer would likely go ahead with the t-test and assume normality.
What if you know the underlying measurements are not normally distributed? Or what if your sample size is large and the test for normality is rejected? In this situation, you can use nonparametric analyses. These types of analyses do not depend on an assumption that the data values are from a specific distribution. For the two-sample t-test, the Wilcoxon rank sum test is a nonparametric test that could be used.