Editor's note: The subject of this month's "Data Use" is a response from Albert Madansky, professor of Business Administration, director, Center for International Business Education and Research at the University of Chicago, to two articles on significance that appeared in QMRR, "The Use, Misuse and Abuse of Significance," by Patrick M. Baldasare and Vikas Mittel (November 1994) and "What is Significance?" by Hank Zucker (March 1994). Following Madansky's comments are those of Baldasare and Mittel and Zucker.

The recurrence of articles on the meaning of "significance levels" is clear evidence that the concept is murkily understood. The root cause of this may even hark back to poor (or, worse yet, incorrect) exposition of this concept in statistics textbooks. Indeed, Professor Gerd Gigerenzer of the University of Chicago Psychology Department has collected and published a number of misstatements about this concept in the plethora of "statistics for psychologists" books on the market.

That QMRR has published two articles on the meaning of "significance levels" within a year indicates that this concept is murkily understood by the marketing research profession as well. Unfortunately, both of these articles contain ambiguities which help to further muddy one's understanding of this concept. The purpose of this article is to set the record straight, hopefully in a clear enough fashion to dispel any erroneous notions readers may have about this concept.

To provide a context for my comments, consider the following quotes from "The Use, Misuse and Abuse of Significance," by Baldasare and Mittel (QMRR, November 1994) and "What Is Significance?" by Zucker (QMRR, March 1994).

"A significance level of, say 95 percent merely implies that there is a 5 percent chance of accepting something as being true based on the sample when, in fact, in the population it may be false." (Baldasare & Mittel)

"Given our particular sample size, there is a 5 percent chance that in the population represented by this sample the proportions for Group A and Group B are not different." (Baldasare & Mittel)

"It (statistical significance) only tells us the probability with which a difference found in the sample would not be found in the population." (Baldasare & Mittel)

"Significance levels show you how probably true a result is." (Zucker)

What's troublesome about these statements? First of all, the words "something," "it," and "result" (as underlined by me above), as referents of the adjective "true," are somewhat imprecise, which can lead the reader to erroneous conclusions about what "truth" is being assessed by significance testing. Secondly, when Baldasare and Mittel talk about probability of finding a characteristic in the population and Zucker talks about probability of the truth of a conclusion they are expressing a common misunderstanding of what the probability statement associated with a significance test is all about.

Let me illustrate with a simple example. Someone hands me a coin, and I'd like to determine whether the coin is fair. The coin either is or isn't a fair coin. At the moment, only God knows for sure (and perhaps so does the person who handed me the coin). But what does the expression "the probability that the coin is fair" mean? Objectively, that probability is either 1 (if the coin is in truth fair) and 0 (otherwise). Subjectively, one can interpret the expression as "What odds would I give that the coin is fair?" But my odds may not be the same as your odds, which is why I dubbed this interpretation "subjective." And I don't think this is what Baldasare, Mittel, and Zucker are talking about when they use the word "probability."

Let's continue with the example. Suppose I toss the coin 100 times and find that I come up with 60 heads. I can ask myself what is the probability of obtaining 60 or more heads in 100 tosses of a fair coin. In the parlance of significance testing, I postulated a null hypothesis (that the coin is fair) and asked what is the probability of my data (or data more inconsistent with the null hypothesis) arising when the null hypothesis is true. That probability is called a "p-value," and is the only probability calculated in the standard significance testing packages. (In my example, the p-value is .0284.)

What's a significance level? To understand this concept, let me continue my story. Before I tossed my coin 100 times, I sat back and planned my analysis. As a professional statistician, I am asked to make a recommendation about whether or not to accept a posited null hypothesis. Only God knows whether the null hypothesis is true or false. Suppose God were to keep a scorecard on my recommendations, but only on those recommendations made when the null hypothesis is true. (God could also keep a separate scorecard of my recommendations when the null hypothesis is false, but we won't look at that scorecard now.) If I want my lifetime percentage of correct calls, given that the null hypothesis is true, to be 95 percent, I will adopt the following procedure:

1. calculate the p-value as defined above

2. if that p-value is at most .05, I will recommend rejecting the null hypothesis; if that p-value is greater than .05, I will recommend accepting the null hypothesis. (In my example, since the p-value was less than .05, I would recommend rejecting the null hypothesis. Indeed, I would have recommended rejecting the null hypothesis if I had observed 59 or 58 heads, but not 57 or fewer, out of 100 using this procedure.)

On any one recommendation, I don't know whether I'm right or wrong. I can only tell you that the way I operate I'm right 95 percent of the time when the null hypothesis is true. The level of significance is just the p-value that I use as a cutoff in making my recommendations. The only correct statement about significance levels is the following restatement of one of those by Baldasare and Mittel, namely:

"A significance level of, say, 5 percent merely implies that, given my procedure for making inferences, there is a 5 percent chance of my rejecting a null hypothesis based on the sample when, in fact, in the population it (the null hypothesis) is true."

The concept associated with the Baldasare and Mittel quote, "the chance of accepting a hypothesis as being true based on the sample when, in fact, in the population it is false," is called the operating characteristic of a statistical test. More regularly referred to in the statistics literature is the power of the test, which is defined as 1 minus the operating characteristic, or "the chance of rejecting a hypothesis based on the sample when, in fact, in the population it is false." It is this that is being recorded on God's other scorecard, the one he keeps on the accuracy of my calls when the null hypothesis is false. This latter concept is also important in market research, in that it is the power of the test (and not the level of significance) that determines the required sample size. But this is off the main point of this article, and should itself be the subject of a future article in this publication.

--Albert Madansky

PATRICK BALDASARE AND VIKAS MITTEL'S REPLY:

We would like to thank Al Madansky for taking the time to carefully read the articles related to statistically discernible differences (a.k.a. statistical significance). While the points made by Madansky warrant consideration, we should point out that the differences between his work and ours are essentially due to a difference in orientation.

We are at issue with Madansky's conclusion that publication of two articles in QMRR about significance testing indicates that "this concept is murkily understood by the marketing research profession." While some people within the marketing research field do not fully understand the concept, it is stereotypic thinking to draw judgments about the profession as a whole. Publications such as QMRR serve as vehicles for continuing education among professionals who may not have the time to take formal classes to refresh their skills. By publishing articles on topics that are of practical importance, QMRR and other such publications (1) provide a forum for professionals to brush up their skills and knowledge and (2) remind readers of the importance of basic concepts. Publishing more than one article on a given topic does not suggest ignorance or ambiguity on the profession's part. Rather, it shows the field's penchant to revisit and revive basic concepts that are useful.

Second, the words something, it, and result refer to the alternative hypothesis. While phrasing sentences in technical terms such as "the null hypothesis" and/or "alternative hypothesis" may make the exposition seem more precise, they do not necessarily render it more understandable or readable. In fact, carrying Madansky's recommendation to the extreme, we could express the entire problem in terms of mathematical symbols. While this would make our exposition more technically appealing, it would not necessarily make it more practical or useful. Ironically, it is the same sort of difference that we pointed out between statistical significance and practical significance.

Third, the restatement of our point by Madansky says nothing new. On careful examination we find that his restatement is the same statement as ours, except that he phrases it in terms of the null hypothesis compared to our phrasing in terms of the alternative hypothesis. Practitioners are more used to thinking in terms of the alternative hypothesis rather than the null hypothesis. For instance, a manager is more likely to understand the statement that we are looking for differences, rather than the statement that we are trying to gather evidence against the assertion that there are no differences in the population. Therefore, to enhance the readability of the article we phrased our sentences accordingly. Is the glass half empty or is it half full? We believe such a debate only muddies the issues.

Nevertheless, Madansky's note is useful in itself because it describes the philosophy of undertaking a test of significance using a common example. Additionally, he highlights the dilemma practitioners wrestle with on a daily basis: what is the dividing line between technical clarity from a purist's perspective and practical clarity from an end user's viewpoint? This question does not have a right a wrong answer. We leave the readers to draw their own conclusions.

--Patrick Baldasare and Vikas Mittel

HANK ZUCKER'S REPLY:

Prof. Madansky seems to have misunderstood the aims of my article "What is significance?" A primary aim was to avoid statistics jargon.

One of the reasons that the meaning of "significance levels" is so "murkily understood" (in Madansky's terms) is the unfortunate choice of words statistics professionals use to discuss the underlying concepts. Many statistical terms mean nothing to the non-expert. Worse, others have clear meanings in normal English that have nothing to do with their meanings in statistics. Significance is a prime example. A non-expert hearing or seeing the term significance level would likely think it refers to importance rather than to the chance of erroneously rejecting a null hypothesis. A key aim of my article was to correct this all-to-understandable mistake.

My article was originally written for a newsletter sent to users of our interviewing and tabulation software The Survey System. Our clients include many academics and long-time research professionals, but also many people new to survey research. The article was written primarily for the latter group. I attempted to give the non-expert a clear, generally correct understanding of the term significance, to explain how to read the probability notations provided by statistical packages and to caution the reader that significance tests do not measure all types of errors. Phrases like "something" or "a result" being true may be less precise than "rejecting a null hypothesis," but they are more easily understood by non-experts. As Baldasare and Mittel mention in their response to Madansky, some sacrifice in precision is often worthwhile for the sake of clarity.

Some readers may find Madansky's approach useful. Others may prefer a less jargonistic approach, especially since it allowed a similar length article to include important information about issues related to statistical significance, not just a definition of the term.

--Hank Zucker