Editor's note: Julius Litman is managing partner of Force Four, Inc., a New City, N.Y., educational consulting firm.

By the mid-20th century, researchers had settled on probability sampling as the preferred technique for accumulating representative survey samples. This not only provided a scientific approach to ensuring representativeness in survey sampling, but also made possible the application of significance testing to the results obtained.

Sixty-plus years later, today's survey research is plagued by low response rates - a serious, worsening, intractable, and exceedingly well documented phenomenon which has been a salient feature of the survey methods literature since the emergence of polling in the 1930s, and a regular feature of statistical and social science journals since the 1940s. 1

Nonresponse - the inability to complete interviews with all qualified members of a sample - is of material importance because it adversely affects our ability to draw representative samples. More to the point: nonresponse diminishes survey research by robbing it of its claim to science. This is because: if we can not trust the samples we draw to be "representative," then survey research can not really be trusted to be science.

Ultimately this means survey research is destined to become, and be regarded as, more art than science. Lucrative, when done well; nothing to be sniffed at; and certainly admired when successful in its ability to "read" an audience...but hardly science.

The challenge confronting today's survey researchers then is to reclaim the science. This article argues that by leveraging the Internet to conduct rapid-fire study replications - reproducing results and yielding converging data - researchers can re-actualize survey research's claim to science.

Survey research science

As practiced today what - if anything - is scientific about survey research?

It's not the questions we ask nor the manner in which we ask them. It's certainly not the response alternatives we provide or the way survey research data are tabulated. Nor is it the analytical schemes or statistical tests we apply.

None of these is in any way scientific. All are matters more of art and judgement, than science. They are almost always driven by budget constraints, and entirely too often by tradition, as in: "Well...this is the way we've always done it."

The one thing, the only thing, that imbues survey research with science is: sampling. It is in our insistence that a sample be scientifically drawn to be representative that science makes an appearance at all in the survey researcher's work.

If sampling is the only scientific aspect of survey research, then today's problems of nonresponse make it difficult for survey research to claim a foundation in science. Stated somewhat differently, if the "representativeness" of our samples is in doubt, then what science there is in survey research...evaporates.

So if the science in survey research can no longer come from adherence to probability sampling, is there some other aspect to the scientific method on which survey researchers might depend?

Fortunately, there is. In the notion of reproducibility. Better still, we now have the technology to cost-effectively reproduce survey research findings, thus enabling researchers to re-establish survey research's grounding in science.

Replication and reproducibility

The Internet makes it possible for survey researchers to cost-effectively and rapidly replicate their studies. Because of the speed and lower cost with which interviews can be accumulated, we can harness the Internet to reproduce study findings rapidly and repeatedly.

The point is that instead of relying on sampling to ensure the science in survey research, we can leverage our ability to rapidly replicate research findings. This ability to reproduce findings goes a long way to assuring scientific credibility.

There are things, of course, reproducibility can not do. It can not, for example, guarantee valid results. Nor can it solve the problem of self-selection. Reproducibility in and of itself can not cure all the ills of survey research. It is not a panacea. It is, however, a robust mechanism for self-correction and the enforcement of performance standards.

What reproducibility does do is assist in exposing spurious findings. It assuages our concerns about the reliability of our survey results by producing - or not producing - converging data. It facilitates decision making based on repeated trials and most importantly weans us of our reliance on the design and execution of single, one-shot, "definitive" studies.

Sampling

Today's survey researcher relies, in the main, on two methods of sampling: quota and probability sampling.

In quota sampling, samples are made representative through stratification. This stratification is typically achieved through the calibration of such control factors as: gender, age, geographic distribution, race, ethnicity, education, religion, occupation, political affiliation, income, marital status, number of children, and so on.

Despite our reliance on quota sampling, there are a number of drawbacks to its use. It presumes an accurate estimate of the incidence of the various control factors in the population from which the sample is being drawn. It assumes, as well, a robust understanding of the control factors themselves, including their relevance to the unknown characteristic being measured.

Then there are the practical difficulties associated with attempting to mesh several controls together. This is especially true when attempting to stratify within a stratum, e.g., controlling for specific age cohort distributions by gender.

Finally there is the uneasy feeling that a sample's forced similarity with a population generates undesirable byproducts of the stratification process itself. In short, we become concerned that the more we try to control a sample, the more distorted it becomes along other relevant but unseen dimensions.

By the early 1950s, therefore, survey researchers' thinking about sampling coalesced around the notion of the probability sample2, i.e., a sample in which the selection of all respondents was by random methods such that all qualified population respondents have a known chance of being included in the sample. Or more precisely in the words of Frankel and Frankel:

"Probability sampling (is) the process of selecting elements or groups of elements from a well-defined population by a procedure which gives each element in the population a calculable non-zero probability of inclusion in the sample. (Where) the phrase 'calculable non-zero probability of inclusion' means that every element in the population has some chance of being included in the sample and that the probability of its inclusion can be determined."3

The problem of non-response

Much has been said about nonresponse. The purpose here is not to: belabor that literature, bemoan the causes of non-response, besmirch the tactics designed to improve response rates, nor decry compensating for nonresponse by data imputation or weighting. Nonetheless, several points about the historical treatment of and survey researchers' reactions to response rates are in order.

As exhaustively documented by Tom Smith of the NORC4, response rates are seldom reported. Now the likelihood that the absence of published response rates is because they are uniformly robust seems slim. Instead, it is more likely that survey researchers avoid reporting response rates mostly because these are so lackluster.

In Smith's own words: "Overall, reporting of nonresponse is a rarity in the mass media and in public polls. Nonresponse is more regularly documented in academic and governmental studies, but is still sporadic at best (emphasis added).ºCites in books and project reports also are very uncommon (emphasis added)."5

We are forced to confront as well the observation of little conformity in the way response rates are calculated. Again in Smith's words: "First, response rates and other outcome rates are usually not reported. Second, when reported, they typically are not defined. Third, when their basis is documented, various meanings and definitions are employed. As Richard Lau (1995, p. 5) has noted, "Unfortunately, survey organizations do not typically publish response rates; indeed there is not even a standard way of determining them." And as James Frey (1989, p. 50) has observed, "Response rates are calculated in various ways by different research organizations. The basis on which these rates are calculated is not often reported."6

Instead of dwelling on the under-reporting of response rates and the absence of standardization, we consider instead the importance of response rates and what might be called the Sheatsley Admonition7: "A probability sample in which only 60 percent of the assigned cases have been completed can be subject to severe bias, since the 40 percent who are 'hard to get' may have quite different characteristics.... Generally, (an) 80 percent (completion rate) is regarded as acceptable; anything below 75 percent may be viewed with suspicion." 8

Or more recently, in 1979 guidelines issued by the U.S. government's Office of Management and Budget (OMB):

"It is expected that data collection based on statistical methods will have a response rate of at least 75 percent. Proposed data collections having an expected response rate of less than 75 percent require special justification. Data collection activities having a response rate of under 50 percent should be terminated. Proposed data collection activities having an expected response rate of less than 50 percent will be disapproved. As a general rule, no request for clearance of a general purpose statistical survey or report having an anticipated response rate of less than 75 percent will be approved unless the Office of Federal Statistical Policy and Standards of the Department of Commerce concurs in the request." 9

Well, things have certainly changed. Nowadays we, and the U.S. Government, content ourselves with response rates that are well below 50 percent10. This, of course, is at the peril of probability sampling, which - as we noted at the start - underlies survey research's claim to a basis in science. To understand why this is true, we consider some issues of sample size.

The law of large numbers

The law of large numbers is a straightforward proposition. It is the mathematical proof of the theory that a relative frequency, mean, range, or standard deviation calculation will be more accurate with a large sample than with a small sample.

It is the law of large numbers that enables us to put our faith in survey research as sample size increases. Simply stated: the more randomly selected, independent observations we have, the more confident we are that those observations reflect reality.

O.K., you say, if the law of large numbers applies, then all we need do is make sure we complete interviews with large samples of respondents. And in fact this mistaken rationale governs the way most of today's survey research is conducted.

Whether probability sample or quota sample, the emphasis in the conduct of survey research is on the accumulation of large sample sizes; as if that alone suffices to ensure the accuracy, reliability, and (most importantly) the science behind our survey research findings.

Let's see if we can convince ourselves of this.

A common sense approach

Suppose we set out to interview people, randomly and independently selected, about their interest in the color blue. We approach 10 people, and only one or two of the 10 answer our question. That is to say, we have between a 10 percent and 20 percent response rate.

Common sense says that it's probably neither reasonable nor prudent to assume that we know the responses of the other eight or nine people who do not speak to us based on the feelings of the one or two people we do speak to.

We could, of course, drastically increase our sample size. This time we succeed in asking 1,000 people about the color blue. Does this much larger number of respondents really make us more comfortable with the notion that the answers received from those 1,000 respondents represent the views of the other 4,000 to 9,000 respondents we contacted, but who made no response?

Don't we suspect in both cases that the 10 percent to 20 percent of respondents who did answer our question(s) differed somehow from the 80 percent to 90 percent - i.e., the vast majority - of all potential, qualified respondents who did not respond?

Uncertain? Well, let's roll some dice.

An empirical approach11

We begin with a single six-sided die. Assuming that it is a fair die, then each of the six possible events that can take place on a given trial are equally probable. This means on each roll there is: a one in six chance of throwing a one; a one in six chance of throwing a two; a one in six chance of throwing a three; and so on.

Now consider our ability to estimate the probability of obtaining these six outcomes as the number of trials changes. The law of large numbers says that our ability to estimate the probabilities associated with each outcome should improve as the number of trials increases. So as the number of trials increases, we expect there to be equal numbers of ones, twos, threes, fours, fives, and sixes thrown. This is true because the law of large numbers says we get a truer picture of the underlying distribution of outcomes as the number of trials increases.

To complicate the game a bit, let's take two six-sided dice and toss them. We will now use the sum of the two dice as our outcome. Obviously, there is now a different number of base events that can give rise to each of the possible outcomes. The probabilities of these outcomes are shown below.

Outcomes    

Events

    Proportion   

2

(1,1)

1/36

3

(1,2)(2,1)

2/36

4

(1,3)(2,2)(3,1)

3/36

5

(1,4)(2,3)(3,2)(4,1)

4/36

6

(1,5)(2,4)(3,3)(4,2)(5,1)

5/36

7

(1,6)(2,5)(3,4)(4,3)(5,2)(6,1)

6/36

8

(2,6)(3,5)(4,4)(5,3)(6,2)

5/36

9

(3,6)(4,5)(5,4)(6,3)

4/36

10

(4,6)(5,5)(6,4)

3/36

11

(5,6)(6,5)

2/36

12

(6,6)

1/36

The law of large numbers is still at work and our ability to estimate the probabilities associated with any of the outcomes improves as the number of trials is increased.

But what happens if for some unknown reason our view of one or more outcomes, say the (3,1) or (6,2) combinations, is intermittently obscured or (worse) occurs but is never reported at all? Well, obviously our ability to estimate population parameters is adversely affected. This remains true no matter how many times the dice are rolled. So long as some outcomes are never reported or otherwise obscured, or over-reported - no matter how many times we roll the dice our population estimates are going to be wrong.

This is the problem with non-response. No matter how large our samples, so long as 80 percent to 90 percent of all respondents contacted refuse to cooperate we can not be certain that some outcomes are obscured or go unreported altogether; and that the 10 percent to 20 percent of respondents who do cooperate are, therefore, not representative of the population from which the sample is drawn.

So if survey research as practiced for the last 20+ years is scientifically dubitable because of our problems in getting representative samples, is there some other aspect of the scientific method we can deploy to reintroduce science to the discipline of survey research?

Fortunately there is.

Reproducibility

Consider cold fusion. The central claim of cold fusion is that the fusion of deuterium can be initiated and maintained in an electrochemical apparatus not much different from that used to demonstrate the breakdown of water into its component gases in a high school chemistry class.

This claim, first reported in March of 1989, was put forward by two respectable scientists: Fleischmann of the University of Southampton in England and Pons of the University of Utah. Regrettably, this initial claim of "cold fusion" was quickly met by counterclaims from equally respectable labs and investigators to the effect that the initial findings could not be replicated.

In the midst of the ensuing controversy, the Department of Energy had its Energy Research Advisory Board (ERAB) conduct a study of the situation. But even before the ERAB panel issued its report, the consensus of mainstream researchers was that the claims of cold fusion's adherents were invalid. The ERAB report formalized and solidified that consensus - a consensus which remains unchallenged to this day.12

Reproducibility in survey research

Reproducibility: the replication of results, is the sine qua non of the scientific method. Science does not accept as "scientific" results which can not be reproduced. The notion behind reproducibility is that if the results obtained by one scientist can not be replicated by a similarly skilled scientist using similar equipment, then there is a concern of spuriousness in the original finding. Reproducibility is the check which scientists apply to ensure that science is science and not mere idiosyncrasy.

Until very recently the replication of survey research studies was impractical. This was for several reasons, but most often because of the cost involved. Few organizations could afford to launch more than one execution of a survey research study.

However, even where cost is not a constraint, the amount of elapsed time required to conduct, analyze, and report the replication of a study inevitably means that any observed differences between the original effort and its replication might equally be the results of "history" and "maturation"13.

These three considerations - cost, history, and maturation - doomed most efforts at study replication from the start; and, ironically, have driven the need to create the "definitive" one-shot study. After all, if we have neither the time (because it takes too long) nor budget (because it costs too much) to do it more than once, then we'd best get it right the first time. Survey researchers have therefore been at pains to design research, execute the design, get an answer, and live with the results.

Not being fools, experienced survey researchers therefore rely not on any single study, but instead look for converging data. This search for converging data is a weak form of study replication. We look for confirming data from multiple, independent sources to buttress (or dispel) the findings of our survey research efforts.

Reintroducing science to the art of survey research

The advent of Internet-based market research makes it possible to introduce the key element of reproducibility to survey research. Using Internet-based techniques we can rapidly and cost-efficiently accumulate and simultaneously report results from multiple locations. By insisting on reproducibility we incorporate a key requirement of the "scientific method" in our survey research efforts.

More concretely, consider our ability to screen respondents for their use of widgets and then interview 1,000 widget users in the space of an evening by intercepting them at several of the popular search engines or widget enthusiast Web sites. Let's say we did this for five consecutive nights with detailed results for each night available the next morning.

Now imagine that the results to a question about willingness to buy a $10.95 blue widget was 20 percent on Monday, 22 percent on Tuesday, 18 percent on Wednesday, 17 percent on Thursday, and 23 percent on Friday, as shown in the table below.

Night

Sample Size

    Willingness to buy $10.95 blue widget   

Monday

1000

20%

Tuesday

1000

22%

Wednesday

1000

18%

Thursday

1000

17%

Friday            

    1000    

    23%    

Total/Average    

5000

20%

If we found nothing unusual about the demographic composition and several widget purchase correlates for these five samples of 1,000 respondents, we'd likely feel comfortable concluding that 20 percent of the widget-using population would buy a $10.95 blue widget. We might even feel justified in saying the estimate is 20 percent ±3 percent, though this would not be a confidence interval.

Our ±3 percent interval around the mean of 20 percent is an estimate based on empiricism. It's an observed, not a computed, range. It is based on repeated trials instead of on assertions about the random nature of our sample and the normal distribution of blue widget purchase behavior among widget users.

One can easily think of variations which would multiply our faith in these replications. We could replicate more often: in the very early morning and on weekend nights in addition to evening mealtime hours. We could, if we wanted, replicate frequently but over extended periods to check for seasonality effects.

We could replicate across a broad range of search engine or widget enthusiast sites. We also could replicate across related subject area Web sites or other Internet locations where we might expect to find widget users.

We would replicate the study as often as practicable until we obtained converging data. Depending on the importance of the decision at hand, our concerns about false acceptance versus false rejection, and the investigator's personal view of what constitutes converging data, we might standardize on more rather than fewer replications before agreeing that "convergence" had occurred.

If the data never converged we'd probably suspect something wrong with the questions being asked. We would reword the questionnaire and try again. As likely, we would also re-examine the underlying assumptions about the sources, quality, and/or self-selected natures of our samples.

Rather than attempt to design the definitive one-shot study, executed and reported over a 14-to-30-day period (or longer), we look instead for converging data from multiple iterations of a study executed and reported in rapid fire fashion. The science underlying this approach has nothing to do with sampling theory. It depends instead on the common sense requirement that before accepting a study we insist on the reproducibility of its results.

Representativeness

Well what about representativeness? Do many, rapidly executed iterations of a study across multiple Internet contact points on a 24x7x365 basis mean we are more likely to have a representative sample?

Unlikely. Again, response rates and self selection come into play. While there is current experience with Web site intercepts yielding response rates of 40 percent to 60 percent, it is almost certain that these response rates will decay over time.

More importantly, perhaps, is the concern that the respondents who respond to Web site intercepts are strongly involved positively or negatively in the event/activity experience. Perhaps, too, they are simply people who enjoy responding to surveys, so-called professional respondents.

None of this matters. It is no different than in traditional, survey research efforts where respondent self-selection is likely for much the same reason(s). Use of the Internet to replicate a study, therefore, is neither worse nor better than current survey research techniques in terms of the reasons respondents self select to respond.

Rapid iterations of a study across multiple Internet contact points on a day-in/day-out basis are superior, however, in other ways. There is the constant, ongoing, rapid-fire, replication of a study; the speedy development of converging data; and the ability to say: "No matter how many times I sample from within this population, the measure remains about the same."

As the frequency of these sampling iterations approaches continuous measurement, barometer-like information is obtained. In such instances the absolute value of a reading may be of less importance than its directionality: is the reading going up, going down, or is it flat, unchanged?

Continuous measurement

Continuous measurement is the logical extension of the notion of rapidly repeated trials, except that instead of sampling repeatedly, we sample continuously. In situ continuous measurement is everything survey research aspires to be, but can not achieve. This is because by its very nature current survey research practice is either retrospective or projective.

We can ask someone what they did and how they felt when they did it; or we can ask someone what they're likely to do and why they think they might do it. It is only with the advent of the Internet that we have the ability to intercept large numbers of respondents in the very course of their experiencing an activity or event, akin to the way an ethnographer might study a population.

The new science in survey research

Given the impacts of poor and decaying response rates on sampling, it is increasingly clear that sampling can no longer be depended upon for the science in survey research. That science must come from some other aspect of the scientific method.

Instead of "definitive" one-shot study designs with uncertain representativeness, we must broaden the search for converging data. This search is facilitated by the Internet, as its technologies enable survey researchers to generate multiple iterations of a study quickly, efficiently, and cost-effectively - across time (24x7x365) and across space; worldwide, as adoption of the Internet continues its wildfire proliferation.

Whether executed and reported in staccato-like rapid-fire fashion, or implemented as continuous measurement, the ability to amass and quickly analyze data collected from multiple iterations of a study facilitates not only the pursuit of converging data, but also the acceptance of research study findings based on their reproducibility.

The science underlying this approach has less to do with sampling theory and more to do with the fundamental requirement of the scientific method that before investigators accept a study finding, they insist on the reproducibility of its results.

References

1 Smith, Tom W., Developing Nonresponse Standards, International Conference on Survey Nonresponse, Portland, Oregon, October, 1999, p. 1.

2 Lockley, Lawrence C., "History and Development of Marketing Research," in Ferber, Robert (ed.), Handbook of Marketing Research, McGraw-Hill, New York, 1974, p. 1-15

3 Frankel, Martin R., and Frankel, Lester R., "Probability Sampling," in Ferber, Robert (ed.) op.cit., p. 2-231.

4 Smith, op. cit.

5 Smith, op. cit., p. 31.

6 Smith, op. cit., p. 35.

7 Sheatsley, Paul R., "Survey Design," in Ferber, Robert (ed.), op. cit., p. 2-75.

8 Seconded by Paul Erdos, among others, who notes approvingly that "The Advertising Research Foundation recommends an 80 percent or better response on mail surveys,...." in Erdos, P., Professional Mail Surveys, Krieger Publishing Company, Malabar, Fla., 1983, p. 144.

9 Quoted in Smith, Op. cit., p. 14.

10 See for example the response rate data supplied by the Council for Marketing and Opinion Research (CMOR) initiative to collect and report survey cooperation, refusal and response rates at www.mra-net.org/docs/resources/coop_rates/coop_rates_avg.cfm. A recent visit to this site revealed a mean 12 percent response rate among RDD samples (n= 404) and a mean 28 percent response rate among telephone list samples (n=185).

11 I found very nearly the same dice-based examples below at various places on the Internet. I am indebted to Toby Dye of the Parmly Hearing Institute at Loyola University Chicago for the version posted by him at www.parmly.luc.edu/statistics/stat6.pdf.

12 Paul Jaffe, Introduction to the Internet Edition of: Cold Fusion Research, A Report of the Energy Research Advisory Board to the United States Department of Energy, Washington, DC DOE/S-0073 DE90 005611, November, 1989, posted at http://www.ncas.org/erab/intro.htm.

13 "History, the specific events occurring between the first and second measurement in addition to the experimental variable. Maturation, processes within the respondents operating as a function of the passage of time per se (not specific to the particular events), including growing older, growing hungrier, growing more tired, and the like." Campbell, Donald T., and Stanley, Julian C., Experimental and Quasi-Experimental Designs for Research, p. 5, Rand McNally College Publishing Company, Chicago,1966.