By the Numbers: Ordered up wrong - the importance of using the correct statistical test | Articles

Editor’s note: Stephen J. Hellebusch is president of Hellebusch Research and Consulting, Inc., Cincinnati .

One phenomenon that has always baffled is why anyone would hire an expert to work on a project and then order the expert to ignore his/her knowledge and do it wrong. A recent experience drove home the confusion in a pointed way.

As many marketing researchers familiar with statistics are aware, you use a different statistical test when you have three or more groups than you do when you have only two. The automatic statistical testing that is very helpful in survey data tables does NOT “know” this, and cheerfully uses the two-group test in every situation, regardless. Each statistical test addresses a slightly different question, and the question is very important in the selection of the correct test to use.

In a recent “pick-a-winner” study, we had three independent groups, each one based on a different version of a concept - Concepts A, B and C. We used a standard purchase intent scale (definitely will buy, probably will buy, might or might not buy, probably will not buy, definitely will not buy), and the question was: How do we test to see if there is any difference in consumer reaction to the three?

The first test was analysis of variance (ANOVA), which addresses the question: Do the mean values of these three groups differ? We used weights of 5 (for “definitely will buy”), 4, 3, 2 and 1 (for “definitely will not buy”) and generated mean values for each of the three concepts. The ANOVA showed that the means did not differ significantly at the 90 percent confidence level, which leads to the conclusion that consumer reaction to these three concepts on purchase intent does not differ, on average.

At this point in the project, the client was displeased, and told us to test using the “definitely/probably will buy” percentages (the top two box). This is another testing option that makes sense. The chi-square test addresses the question: Do these three percentages differ? It is the proper test to use when there are three or more percentages to test across three different groups of people. We conducted it, and learned that the percentages did not differ significantly at the 90 percent confidence level. It told us that, with respect to positive purchase interest, across all three products, the consumer reaction in terms of the top two box was the same.

The wrong test

The client was displeased. Having conducted the testing himself, he learned that Concept B was significantly lower than Concept A, both in the top two box and in the top box, at the 90 percent confidence level. He told us not to use the chi-square, but to use the test the data tables use. The Z test addressed the question: Do these two percentages differ? When it is misused in a situation where there are three or more groups, this testing method disregards key information, and makes the determination after having thrown out data. To please the client, we conducted multiple Z tests and determined that there were no statistically significant differences between any of the three pairs (A vs. B; A vs. C; B vs. C) at the 90 percent confidence level. The client had another person in his department conduct the test, and that testing showed, as the client’s had, that the top two box for A was significantly higher than B’s at the 90 percent confidence level.

Fairly confused at this point, we ran the data tables, which showed, exactly as the client said, that A was significantly higher than B at the 90 percent confidence level, both on the top two box and on top box percentages.

The less-preferred formula

We then conducted the three tests by hand, and compared our Z values with the client’s. We learned that the client, his department mate, and the statistical testing in the survey table program all used the less-preferred Z test formula. There are two versions of this test. One of them does not use the recommended correction for continuity. This, essentially, is a very small adjustment that should be made because the basic formula assumes a continuous variable (peanut butter) and we are actually working with a discrete variable - people (peanuts; the count of respondents making up the top two box). Normally, it makes no difference in the results, because it is so small. In this case, however, it made the difference between crossing the line into significance and not crossing it.

With that resolved, we discussed the client’s desire to test every row of the scale, with the wrong statistical test, using the less-preferred formula. We were told that the client always does this, and that we should do so. So, we did.

The wrong way

This procedure violates the fundamental logic behind testing. By testing the top two box, we have tested the difference between these three concepts on this scale. When we test the five rows of the scale (and various other combinations of rows), using multiple Z tests, the probabilities are so distorted that it is doubtful anyone knows the confidence level at which we are really operating.

So, we successfully used the less appropriate formula with the wrong test and followed the wrong procedure for testing. We remain baffled.