Editor's note: William S. Farrell, Ph.D., is director of marketing at Sociometrics Corp. in Los Altos, Calif.

I teach market research as well as conduct it, and when I come to the part of the course where significance testing enters the picture, it's never clear who is more worried - me or the students. We're worried about the same thing, of course: the difficulty of teaching (learning) the dauntingly complicated theory underlying significance testing. There are problems even when I try to avoid most of the theory - normal distribution, central limit theorem, etc. - and go with a "cookbook" approach.

I usually have my students analyze data using a spreadsheet package such as Excel, since few of them have access to a statistical package. As soon as they try to run their first t-test, however, they are forced to make decisions about "homoscedastic" vs. "heteroscedastic," among other things. And even if they are fortunate enough to have access to a true statistical package like SPSS, they don't know which of two p values to use for the t-test until they understand something about "Levene's F test for equality of variances."

Is it any wonder that my students react to statistics the way they react to Freddy Krueger? Fortunately, help is on the way (for practitioners as well as students), in the form of something known as the bootstrap technique.

I'll introduce it by way of an example. Let's say we're rolling a pair of dice (you didn't think you'd get through a statistics article without reading about dice, did you?) and we're curious about how often a seven will show up. We could answer the question using the formula for the binomial expansion - if we remembered the formula for the binomial expansion -- or we could do it another way.

First, we'd count how many ways there are to roll a seven: 1-6, 2-5, 3-4, 4-3, 5-2, 6-1 - six ways in all. Then we'd count the total number of ways two dice could come up: 1-1, 1-2, 1-3, etc. I'll spare you the list - there are 36 ways altogether.

So there's our answer: we simply divide six (ways to get a seven) by 36 (total combinations) and find that a seven should come up about 17 percent of the time, on average. You can bet on it.

How does this relate to significance testing? Let's look at a hypothetical example more directly relevant to market research. Say you've just conducted your annual customer satisfaction survey and you find that customers in the Northeast give you a 9.2 rating on a 10-point scale, while customers in the South give you an 8.5 rating. You'd like to know if the difference of 0.7 is statistically significant.

One (good) way of re-stating your question is as follows: if chance factors alone were at work, how often would you get a difference as large as 0.7 between the means for these two groups of customers? That question can be answered using a traditional t-test, or we could apply the bootstrap method in a way that's analogous to what we just did with the dice. Theoretically, we'd list all possible ways your customers could have responded, then we'd calculate the proportion of those in which the difference between sample means was equal to or greater than 0.7.

Practically, we'd do something like this: let's say you have responses from 93 customers in the Northeast and 58 customers in the South. We'd put all 151 numbers into a pot; draw a sample of 93 with replacement and calculate the mean; draw a sample of 58 with replacement and calculate that mean; calculate the difference between the two means; and then store that difference. This process would be repeated perhaps a thousand times. When we were done, we'd calculate the proportion of differences that equaled or exceeded 0.7.

Though you may find this difficult to accept at first (I certainly did), that proportion is conceptually the same as the p value one could calculate in Excel or SPSS, and is in fact a more valid answer to the question of whether the two groups differ.

The bootstrap p value and the traditional p value are conceptually identical because they both tell us the following: If we repeated the customer satisfaction study many times, and there were no difference between the two populations, we would observe a sample difference of 0.7 or greater exactly p percent of the time.

The bootstrap value is more valid than the traditional p value because it doesn't depend on a major assumption underlying traditional significance testing; namely, that the distribution of what we're measuring is normal in the population (or alternatively, that we have a large enough sample so that the sampling distribution of the mean is normal).

Alert readers will have noticed that in our hypothetical application of the bootstrap, we looked at only 1,000 shuffles of the customer data, not all possible combinations as we did with the dice. Is this kosher? It is, but the details would take us too far afield. Suffice it to say that in most implementations of the bootstrap, 1,000 to 3,000 iterations (depending on the specific problem) have been shown to produce extremely accurate p values.

Does the bootstrap work in the "real world" of market research? You can bet on it. I recently asked a national sample of physicians to rate, on a 10-point scale, the importance of 25 attributes of a medical device. I wanted to compare the ratings of two subgroups of physicians, to see if one group viewed any of the attributes as differentially important.

One group was much smaller than the other - 47 vs. 131. Despite this difference in sample sizes, SPSS told me that sample variances were equal for the two groups on 22 of the 25 attributes (remember Levene's F test?). For those 22 attributes, the two-tailed p value computed using a bootstrap p procedure differed by no more than .006 from the p value calculated by SPSS in a traditional t test. This was reassuring.

For the three attributes where SPSS said the groups had different variances, things got interesting. Differences for two of these attributes were deemed non-significant, both by SPSS and by the bootstrap. For the third attribute, SPSS computed a p value of .049, a value that meets the "standard" criterion for statistical significance. The bootstrap procedure computed a p value of .12 for this attribute - not even close to significant by most people's standards. Which one did I believe? I think you can guess.

The real question is why this technique is only now coming into widespread use, and the answer has a lot to do with computer power. Typical bootstrap significance tests that might take one to five minutes to solve on a fast 486 today would have required hours on a fast 286 a decade ago.

You might be wondering why this technique, first described in 1979 by Stanford statistician Bradley Efron, is called the bootstrap. The term is a whimsical reference to the fictional Baron von Munchausen, who is said to have avoided drowning by pulling himself up by his bootstraps from the bottom of a lake. It reflects the notion that analysis is performed without the help of outside agencies, such as the normal distribution.

The bootstrap has been implemented under a variety of descriptive rubrics, including distribution-free statistics, resampling statistics, exact inference testing and permutation statistics. They all have in common the notion of repeated sampling from the original data, calculation of a statistic with each sampling, and then inspection of the resulting distribution of that statistic.

The technique can be applied to data at all levels of measurement: nominal (categorical), ordinal (ranking), interval and ratio. It can be used to assess significance (p values) and to compute confidence intervals. The technique is not a new one, but it is becoming newly accessible to the vast majority of market researchers whose computing resources lie somewhere between a calculator and a Cray.

And compared to teaching the normal curve, central limit theorem, etc., I find it much easier to convey what boils down to a three-step process: (1) What's our result? (2) What are all the different results that could have occurred? (3) How many of the possible results equal or exceed ours?

I believe this paradigm will transform the way statistical analysis is taught and conducted. Stay tuned.