Editor’s note: Terry Grapentine is principal of Grapentine Company LLC, an Ankeny, Iowa, research firm.

In their article “The use, misuse, and abuse of significance” in the November 1994 issue of this magazine, authors Patrick Baldasare and Vikas Mittel made the case that there is a difference between statistical versus practical significance. Just because a statistical test shows a 95 percent probability that the difference between two means or percentage scores is statistically significant, such a difference may not possess practical significance. For instance, this difference may lie in an attitudinal measure that does not influence consumer behavior; or in a demographic measure that has no relevance to marketing communications.

As they concluded near the end of their article: “Our logic is the following: ... the relevance of a statistically significant difference should be determined based on practical criteria including the absolute value of the difference, marketing objectives, strategy and so forth. The mere presence of a statistical significance does not imply that the difference is large or that it is of noteworthy importance.”

Baldasare and Mittel’s discussion focused primarily on the relationships among random error, sample size and statistical significance. Their article did not examine other sources of error than can affect statistical testing and conspire to distract management from discovering meaningful differences and similarities lurking in a data set.

Therefore, I want to build on their observations by describing additional sources of error in survey research that make identifying statistically significant differences problematic and how large sample sizes can render moot the subject of statistical significance altogether. In particular, I discuss the effects that sampling and measurement error have in calculating statistical tests and the misleading sense of scientific precision that statistical tests project onto research reports. I conclude by offering a recommendation on how to report statistical significance in reports.

Keep in mind that statistical testing does not render a verdict on the validity of your data. On a given measure, your statistical analysis software may reveal statistically significant differences (or not) between two or more respondent groups, but such differences (or lack of differences) could be caused by sampling biases and/or measurement error. Your statistical software assumes that your data is completely valid, which is never the case.

Two kinds of error

There are two kinds of measurement error - random and systematic. Therefore, any survey statistic will be a function of the true value of the parameter one is estimating, plus random and systematic error. Consider the following:

XMean = µ + Random Error + Systematic Error

Where,
XMean = mean value of X
µ = the true but unknown mean value of X.

Random error is error variance that does not bias the data so that the expected value of XMean will be µ. For example, the particular mood of a respondent may affect how he answers a question. Presumably, when drawing a sample from a population, these various respondent moods and how they affect respondent answers to questions will be random across all respondents.

In contrast, systematic error biases statistical estimates, although the direction of the bias may be unknown. For example, if you are measuring how much people weigh and your scale systematically subtracts five pounds from a person’s actual weight, your weight measures will be biased.

These two kinds of error can come from mistakes in your sample (sampling error) or from the questions that appear or don’t appear on your questionnaire (measurement error).

Sampling error

Consider the following sources of sampling error that may underlie your data: a) under coverage; b) nonresponse; and c) self-selection. Whether this error is random or systematic will be a function of how your draw your sample.

Under-coverage. This is a situation in which a segment of the target population is underrepresented. One famous example is the 1936 Literary Digest survey covering that year’s presidential election between Franklin Roosevelt and Alfred Landon. A major portion of potential survey respondents were identified via telephone book listings which, in 1936, underrepresented lower-income, Democrat households.

We face similar sampling challenges today. Consider: a) half of heads-of-households, 25 to 29 years of age, do not have a landline phone; b) consumers who are infrequently home evenings can be underrepresented in phone surveys; and c) sampling/panel companies may not have access to students’ college e-mail addresses or telephone numbers when classes are in session.

Nonresponse. Some people are simply unwilling or not inclined to participate in a survey. A major manifestation today of this problem is consumers’ growing unwillingness to participate in telephone surveys. Richard Curtin et al., report on one study showing telephone response rates declining from approximately 80 percent in 1979 to near 40 percent in 2003 (“Changes in telephone survey nonresponse over the past quarter century,” Public Opinion Quarterly, Spring 2005, pp. 87-98).

Self-selection. One way this can occur is when a respondent can exercise control over completing a survey. For example, an Internet panel participant qualifies and agrees to take an online survey but subsequently finds that she is becoming bored with the subject matter and quits. Bias can therefore be introduced if a disproportionate share of one’s sample is completed by respondents who are not representative of the population of interest (e.g., the sample has a disproportionate number of respondents who simply like the topic).

Measurement error

This kind of error can be attributable to questions that appear or don’t appear on your survey, and may result in either random or systematic error depending up the particular situation.

Question interpretation. One source of data variance due to question interpretation is simply asking respondents a vague or ambiguous question such as the following:

On a scale of 0 to 10, where 0 denotes poor performance and 10 denotes excellent performance, how would you rate the Acme Company on being innovative?

Innovative is a vague term. For example, some respondents may interpret innovative to refer to service innovation and others may think it refers to product innovation. An estimate of the mean score on this attribute would be biased if the researcher intended innovative to refer to services but many respondents interpreted the term to mean tangible products.

Respondent assumptions. Even relatively well-constructed questions will have some level of vagueness with respect to assumptions respondents make before answering a question. For example, product performance ratings can be influenced by the extent to which respondents consider the following issues prior to giving their rating: a) how much the product costs; b) how performance accords with one’s forecast of product performance; c) recent experience with the product vs. one’s use of the product over time; and d) whether the performance of the product being rated is being compared to similar products in the respondent’s mind.

Question order. Where a question appears in a survey can affect how respondents answer it. For instance, asking an overall satisfaction question at the beginning of a survey can elicit a different rating compared to placing it at the end of a survey, where exposure to preceding questions can affect the overall satisfaction rating (e.g., the preceding questions prime either positive or negative memories of one’s experience with the product).

Method variance. I had the opportunity to analyze a restaurant chain’s customer satisfaction data that were collected via two modalities - online and interactive voice response (IVR). Both surveys were identical in their questions and scales. Study findings revealed that data from the online survey had greater variance than data collected via the IVR system.

Additionally, there was some systematic bias - restaurant ratings were higher in the IVR vs. the online format over several time periods in which the surveys were administered. One hypothesis explaining the different findings was that visually exposing respondents to the survey’s rating scales promoted use of a wider range of scale values and more validly reflected the respondents’ views.

Attribute wording. Even the most finely-crafted attribute statements can be reworded, and doing so can affect how respondents answer them. For example, consider the following three alternatives to the question, “With which aspect of our service were you most satisfied?”

With which aspect of our service were you most . . . ?
pleased
delighted
happy.

True, these questions have slightly different connotations. Nevertheless, many words have synonyms and sometimes it’s a coin toss as to what particular wording one uses. Differently-phrased questions can produce different answers.

Omitting important questions. The most prevalent example of systematic measurement error in marketing research is omitting an important variable from your survey. For example, in a multiple regression equation, this can result in a less important independent variable being both statistically significant and judged to be the most influential, when the omitted variable would have been the most important predictor in your model had it been included in the study.

Random or systematic error

Sometimes factors that one may think may only introduce random error into one’s data set can actually introduce systematic error. Underrepresenting important members of a population could result in systematic bias in the data. If the wording of a question is such that respondents systematically misinterpret what the researcher meant by the question, systematic error will result. For example, you ask respondents what was the most important factor influencing their recent purchase. Most respondents think of tangible attributes of the product, when, in realty, the most influential factor was word-of-mouth recommendations.

Not the same thing

Unfortunately, when one includes the results of statistical tests in a report, doing so confers a kind of specious statement on a study’s “scientific” precision and validity. Precision and validity are not the same thing. A study can be very precise in its sample design or measures, but have its validity severely compromised as a result of the factors discussed earlier.

Especially among those who are not familiar with the points made above, my experience suggests that when a layman sees that two measures are “statistically significant,” it attracts the reader’s attention and suggests that “this is something you need to pay attention to,” when the actual case may be just the opposite.

Need to provide guidance

The above examples suggest that we never can be totally sure whether to trust our statistical tests. Yet, researchers need to provide guidance to their audience regarding whether differences one reads in a report should be taken to heart or ignored. Therefore, I share Baldasare and Mittel’s recommendation of not reporting “statistical significance,” but rather reporting “managerial significance.” Additionally, I recommend reporting non-statistically significant results that have managerial implications.

Managerial significance. Identify differences whose magnitude has relevance to decision making. I italicize those words for the following reason:

Magnitude: Virtually all crosstab statistical tests take the following form:

H0 : µ1 = µ2
HA : µ1 ≠ µ2

With a sufficiently large sample size, you will always reject the null hypothesis. And in reality, if you take your decimal points out enough places, virtually no two µ’s are ever precisely equal.

Relevance: The managerially relevant question is not whether two means are different - they always are with a sufficiently large sample size - the question is whether the difference is large enough to matter to decision makers.

The excessive reliance placed on statistical testing in marketing research - given all the factors discussed earlier that can confound the interpretation of these tests - is told in the story of the man who invented Student’s t-test, a test which ironically forms the basis for most of the blindly-followed statistical testing that is done in marketing research today.

William Sealy Gosset (1876-1937), creator of Student’s t-test, was also a brew master for the Guinness Brewery in Dublin. He was the head experimental brewer whose primary responsibility was to understand how various ingredients could affect the quality of Guinness. Economic constraints limited the number of batches of Guinness he could brew in order to test the effects that various combinations of yeast chemistry, barley, hops, water quality and so on had on the product’s quality.

Gosset knew his experimental designs were not perfect (think of our previous discussion of sampling and measurement error) and that small sample sizes could disguise important findings if he overly relied on statistical tests - even his own. He used his Student’s t-test (published under the pseudonym Student) only as a tool. He relied on that tool and judgment to identify factors that had substantive or economic significance - regardless of their statistical significance! From Gosset’s book (emphasis is mine):

“I thought that perhaps there might be some degree of probability which is conventionally treated as sufficient in such work as ours and I advised that some outside authority in mathematics should be consulted as to what certainty is required to aim at in large scale work. However it would appear that in such work as ours the degree of certainty to be aimed at must depend on the pecuniary advantage to be gained by following the result of the experiment, compared with the increased cost of the new method, if any, and the cost of each experiment.”

Insignificance. Just because a statistical test may indicate that two populations are not statistically significantly different on a measure does not mean that your report should gloss over this finding. For example, two competing products’ image attribute ratings may not be statistically significantly different. Yet, if one brand has significantly more market share than the other, this may suggest that other factors outside of brand image may account for this difference, and such factors should be further investigated (e.g., store location, marketing communication effectiveness).

Many factors

In summary, many factors can affect the validity of our statistical testing from how we draw our samples to how we ask respondents questions. Additionally, if our sample sizes are large enough, all statistical tests will be significantly different.

When pondering how to address this issue in your next study, think of William Sealy Gosset. Use statistical tests the same way he used them to understand the chemistry of a fine beer - as a tool to discover, not to define, practical insights.