Editor’s note: Neil Helgeson is senior methodologist with Winona Research, Minneapolis.

After conducting a study to compare two products or services, there are a number of questions that a researcher could ask when comparing the two along a relevant dimension:

1. Which one is better, and by how much? 2. Did the two differ? 3. How likely is it that the two differ? 4. How likely is it that we obtained the observed difference if the two did not actually differ?

Most researchers would like the answer to #1, would settle for #2 or #3, and would say "Who cares?" about #4. Unfortunately, #4 is the question answered when we use an inferential statistical test on our data. Most users of these tests have no idea that it is really #4 they are answering when they ask for statistical testing.

Inferential statistics are required when we work with samples, not populations. When we use samples, we cannot be certain that the conclusions we reach based upon them accurately reflect the populations from which the samples are drawn. To guide us in using our results we can perform a statistical test. The approach is called "null hypothesis testing," and while the calculations vary based on the specific test we perform, conceptually they contain the same steps.

1. Assume the null hypothesis is true. The null hypothesis is the opposite of the conclusion we are testing (which we call the alternative hypothesis). For example, if we think that stated purchase intents for product A and Product B are different (our alternative hypothesis), our null hypothesis would be that the population purchase intent for product A was equal to the population purchase intent for product B. The null hypothesis and alternative hypothesis must be mutually exclusive and exhaustive - one and only one of them is true. It is important to note that these are statements about population parameters, not sample statistics, as our goal is to draw conclusions about populations. If we want to draw conclusions about the samples, we only need to compare their means directly. This is simple, but we usually do not care about our samples, only about the populations they represent

2. Calculate a statistic whose distribution is known, given the null hypothesis. The statistic we calculate depends upon the test - t, x2 , etc. If we are doing a t-test, we calculate t - a statistic that involves the differences in the sample means, the variability in the samples, and the sample sizes. If the null hypothesis is true, we know how likely it would be to get t values in a given range — the range of interest usually being "as large or larger" than the value we obtained.

3. Determine whether it is reasonable to keep assuming the null hypothesis is true. If we can expect to get data which produces t values as extreme as ours infrequently if the null hypothesis is true, we reject our assumption of the null hypothesis in favor of the alternative hypothesis. Common values for the probability of obtaining the data before we reject the null hypothesis are 5 percent or 10 percent. This number is called alpha a; it represents what is called the type I error rate, the probability of rejecting the null hypothesis when it is true.

If we reject our null hypothesis about products A and B at an alpha of .05, it tells us that, if the null hypothesis is true, we would get a statistic (t in our example) as large or larger only 5 percent of the time.

There are a great many things that it does not tell us, the most frequently mistaken conclusion being:

"There is a 95 percent chance that the products are different."

The prevalence of this misinterpretation can be seen in the use of the phrases "95 percent confidence level," and "90 percent confidence level" rather than "alpha of .05" or "alpha of .10." The conclusion we reach is based upon the probability of getting data like ours if the null hypothesis is true, it is not based on the probability of the null hypothesis being true. This is a crucial distinction. Since our real concern is whether the products differ, it is convenient to assume that this is what the test tells us, but the probability of the products being different is unknowable under most circumstances. From Bayes’ Theorem we know that the probability of the two products being different given our data is:

P(data given the alternavitve hypothesis) P(alternative hypothesis)
P(data given the alternavitve hypothesis) P(alternative hypothesis) + P(data given the null hypothesis) P(null hypothesis)

All these are unknown to us. While we might be able to estimate reasonable values for the probability of the null hypothesis and alternative hypothesis, knowing the probability of the data given the alternative hypothesis requires knowing the "real" difference in product means. If we knew that, there would be no reason to do the statistical test! Under normal circumstances, there is no way of knowing the probability of products being different based on the analysis of experimental data. Any phrasing of the analysis of experimental results that states or implies that there is a certain probability of the products being the same or different is completely inaccurate. If we find two means different at the "95 percent confidence level," we are 95 percent confident that, if the null hypothesis is true, we would not have obtained a difference as large or larger than we obtained. We are not 95 percent confident that the products differ.

Another common misinterpretation of the results of a statistical test is that they tell us that:

"The differences are "real," or the findings are "valid."

The finding of statistical significance does not tell us that the differences we observed are "real." We may choose to treat them as real if we find them to be statistically significant, but that is not what is being tested. No matter what the results of a test, there may or may not be a difference, and the difference we observe may or may not accurately reflect the size of that difference. It is important to remember that the means and the differences between means we observe are our best guess of the population means and differences, regardless of the results of any significance testing. If we observe a mean purchase intent of 3.86, 3.86 is our best guess of the population purchase intent, although we realize that the actual value is probably different. Finding that the 3.86 is significantly different from another value does not tell us that the 3.86 is correct, and finding that it is not significantly different does not tell us it is incorrect. The precision of our numbers is not directly addressed by the significance testing.

Another common misinterpretation is that:

"Failure to achieve significance shows that the means are the same."

The observed sample statistics are our best guess of the population parameters. If we find a difference, our best guess is that there is a difference, even if that difference is not significant. Failure to find that a difference is significant may mean that we do not treat the difference as "real," but it does not tell us that there is no difference.

The p values we calculate in reaching a decision about the null hypothesis are not particularly useful in drawing other conclusions. In particular, it is not true that:

"Smaller p values indicate larger differences."

In testing means, the sample size, variability, and absolute difference in means enter into the calculations. If we hold all else constant, increasing the size of a difference will ultimately lower p value when we check our test statistic, but since other factors enter in as well, the p value should not be used as a measure of the size of the difference.

An example of this misapplication can be seen in a situation where our product was compared to a competitor’s product on a series of dimensions, each dimension measured by a question. It would be possible to statistically test the differences in means on each question, and calculate p values for each comparison. It would not be correct to say that our greatest superiority is on those dimensions where we have higher means with the smallest p values, and that we have less superiority on those dimensions with larger p values. While the sizes of the differences do enter into the calculations, larger p values may also be due to more variability in responding to a question (either due to differing understanding of the question or differing expectations of respondents), or they could be caused by reduced sample size, with a larger number of respondents failing to answer a question due to a failure to understand it or a belief that the question did not apply to them.

Given the limitations in the questions addressed by significance testing, why use it? We use it because it provides a threshold that keeps us from being constantly buffeted around by chance variation due to sampling. We realize that a difference we observe may be due to sampling, and that the populations may not really differ. By looking for statistical significance, we are assuring that some threshold has been reached before we act.

We should not mindlessly apply the testing, but adjust it according to the consequences of the actions we may take. If we will use the results of a study to implement a costly change, we should set our threshold high; an alpha of .01 may be appropriate, to reduce the chance of incorrectly rejecting a false null hypothesis. If the gains to be made are large, we may want to set or alpha relatively low, .1 or more, to reduce the chance of failing to reject a false null hypothesis. We should consider the consequences of the types of errors and set our criterion appropriately.

When the results of testing are irrelevant, we should not test. The results will just confuse us. Suppose we are testing 10 potential new product formulations, with the goal of selecting the best three for further development. Assuming there are no cost differences, etc., whether the third-best is significantly better than the fourth-best is irrelevant. Failure to find statistical significance does not tell us that the third-best is no better than the fourth-best, and should not be used as a reason to choose anything other than the three best-performing formulations.

We should use alternatives to statistical testing when they more directly address our concerns. If we are interested in the precision of our numbers, how close our 3.86 is to the true population purchase intent, we should calculate confidence intervals. The results will tell us that a certain percentage of the time, the true value will be in a given range. For example, that 95 percent of the time the population mean will be in the range 3.66 to 4.06.

If our concern is whether a difference is "meaningful," a measure of association such as eta-squared (h2) is appropriate. These statistics tell us what proportion of the total variance is explained by our manipulation. For example, if we obtained a value of .37 in a test of purchase intent for two products, it tells us that 37 percent of variability in purchase intent can be explained by which product was being evaluated. This is quite large. On the other hand, if we obtained a value of .01, it tells us that only 1 percent of variability in purchase intent can be explained by which product is being tested. Large sample sizes make it quite possible to achieve statistical significance with eta-squared values this low or lower.

Statistical testing has its place in marketing research, but its proper role is smaller than the role it currently plays. The somewhat convoluted logic of null hypothesis testing fails to provide answers to the questions which interest researchers the most. Failure to understand what these tests really tell us can lead to incorrect and perhaps costly errors in decision making, and can keep us from using the statistics which might provide more meaningful interpretations of our results.