Skip to: Main Content / Navigation

A comparison of missing value options in regression analysis



Article ID:
19951211
Published:
December 1995
Author:
Gary M. Mullet

Article Abstract

Regression analysis is one tool for evaluating customer satisfaction measurement. Non-response is problematic for multiple regression analysis because most software discards all of a respondent’s data when it encounters a missing value. This article discusses options for coping with item non-response in regression runs, comparing run results based on a real data set.

Editor's note: Gary M. Mullet, Ph.D., is president of Gary Mullet Associates, Lawrenceville, Ga.

Whenever you manage to get off the telephone long enough to even glance at your in-box, you're sure to notice that a large amount of correspondence deals with various facets of customer satisfaction measurement (CSM). It also seems that more and more promotion, compensation and retention decisions are based, at least in part, on the results of CSM studies.
One tool, although certainly not the only one, for evaluating such studies is regression analysis. As readers of this column are aware, regression analysis is certainly widely used in other types of marketing research studies. One bugaboo of multiple regression analysis is item non-response. When (most) computer packages encounter a missing value, they pitch all of the other data from the given respondent, by default.
There are various options for coping with item non-response in regression runs. We will compare the results of some of these below, using a real, albeit disguised, data set. If your livelihood depends on the results of a CSM study, you should be interested in the differing conclusions which may be drawn from these comparisons. All of the results reported below use a 95 percent confidence (5 percent significance) criterion and stepwise regression runs. There are certainly myriad other options available which are not examined below.

Listwise deletion
As already noted, the default option in most programs is listwise deletion. In a very small nutshell, this means that if a respondent fails to answer even one of the many ratings, that respondent ceases to exist for the regression in question. As a case in point, a recent regression on 1200+ respondents yielded not a single valid case for a regression trying to use only 15 (out of 60-some) independent variables to predict overall satisfaction. While this is extreme, it is not unusual to lose 50 percent or more of the respondents to item non-response. Thus, conclusions (and compensation) may be based on fewer than half of the respondents in your carefully designed study!
Our example comes from a data set of 500 respondents who were asked 10 ratings that were potentially related to an overall opinion measure. For proprietary reasons, the 10 scales used for the independent variables will be denoted below as X1, X2, . . .X10, rather than given more meaningful labels. The results of the first regression, using the listwise (default) option, are noted in Table 1 under column A.
As a variation on listwise deletion, some analysts use a portion of the column A results only to see which set of variables is significant and then instruct the computer to run another regression, using only those attributes and pretending that the others don't exist. This can accomplish a couple of things. First, almost assuredly, the base size will increase since fewer variables require answers from everyone. Secondly, (partial) regression coefficient magnitudes may change, as well as order of entry of the variables -- just look below. In some cases, attributes that are statistically significant in the first pass through the data will not be so in this second pass. The results from this "variable screening" analysis are listed under column B.

Pairwise option
In this variation of regression, attributes are (essentially) looked at two-by-two (sounds like Noah's Ark). Without beating anyone over the head with statistical theory, the effect of invoking this option changes the matrix upon which the computer program operates to find the estimated regression coefficients. The results of the pairwise option are under column C, in Table 1.
Mean substitution
Be careful here! Mean substitution for missing values is a very attractive option since it's easy to invoke -- just push a computer key -- and dramatically increases the base size on which these personnel and/or other decisions are made. The mean substitution option fills in the arithmetic mean value for everyone who did answer a given rating for the void existing for those who did not. Thus, everyone is assumed to be "average" on anything that they failed to answer.
Then why be careful? First, if you blithely select mean substitution without any filtering of the data, the mean on the dependent variable, here overall opinion, is also substituted for those who didn't answer it. You will then be running regressions that include a substantial number of people who did not give a rating on the criterion measure -- be they no longer customers, no longer product users or whatever. See column D for this type of mean substitution.
O.K., let's say you're alert enough to run the mean substitution option on only those who gave an answer to the overall opinion question. The results, in column E of Table 1, still include several respondents who answered only one or two of the independent variable ratings, which may cause an eyebrow or two to be raised if the results are broadcast.
Finally, let's look at more intelligent mean substitution. You need to ask yourself, "How many questions should a respondent answer to convince me that they have a grasp of the interview?" For the data which we are looking at, the answer to this (arbitrarily) was set at eight. Then, mean substitution was used for those who met two criteria. One, there had to be a valid answer to the overall opinion question. Second, there had to be at least eight legal answers to the 10 predictor attribute ratings. The regression coefficients are shown in column F of the table.

Respondent mean substitution
Many feel that the major drawback to using the automatic mean substitution option is that an individual with missing values is treated liked everyone else; the mean of all who did answer is substituted as the value for those who did not, as already noted, variable-by-variable. Respondent mean substitution treats each individual as an independent entity; the mean for the questions that were answered (which may require some reverse coding) for each individual respondent is substituted for the value(s) for which there is no answer for that respondent and that respondent only. This, then makes use of scale usage differences between individuals or genuinely different (average) ratings on the independent variables between individuals. As before, the resulting regression may be run irrespective of the number of ratings which a respondent did answer, but in column G you'll find the results of substituting the respondents' own mean for items which had no answer for, as before, those who answered at least eight of the predictors and also gave an overall opinion rating.

Are we done yet?
Just about. We'll leave perusal of Table 1 to the reader during your scarce leisure time. Note, however, that there are some common and uncommon threads between the columns. Depending on your actual application of regression analysis, none of these differences may be daunting at all. Certainly, in some applications they are somewhat scary.
It should be obvious by now that there are still other analytical variations, such as using the pairwise option on the respondent mean substitution data. That's not the point. The important conclusion to draw from the above mathematical manipulations is, it is essential for the analyst to know exactly which options are used on any regression analysis before blindly trying to implement the results, whether they be for sales force compensation, new product share forecasting, brand image analysis or whatever. As always, clear, careful, concise communication is what it's all about. And please, please don't use total mean substitution just to be able to show a regression base equal to the number of questionnaires in hand. While that sounds like a no brainer, it has been done.

Page Tools
Bookmark and Share

Related Suppliers: Research Companies from the SourceBook

Click on a category below to see firms that specialize in the following areas of research and/or industries

Specialties

Conduct a detailed search of the entire Researcher SourceBook directory

Related Articles

There are 756 articles in our archive related to this topic. Below are 5 selected at random and available to all users of the site.

High marks
New Castle County Vocational Technical School District in Delaware surveys students, parents, and teachers every two years, and graduates six to 12 months after graduation, as part of the district’s comprehensive program of marketing research and performance reporting.
Take it to a higher level
Multiple survey instruments assessing organizational performance in a range of areas will have a far greater impact on improving a hospital’s level of service quality than a patient satisfaction survey alone.
Data in many uses
Data collection and data analysis are fraught with difficulties. This article discusses statistical packages and power analysis, focusing on NCSS and PASS.
The eyes have it
When it came time to tweak a long-running print ad series, Saab used an eye-tracking approach that is designed to allow changes to be made to a campaign during the research process rather than afterward.
IVR: How is it different from telephone interviewing?
While traditional data collection methods such as mail and phone continue to be widely used, other data collection methods are growing in popularity. This article discusses one such method: interactive voice response.

See more articles on this topic

Related Events

ESOMAR ANNUAL CONGRESS: ODYSSEY 2010
September 12-15, 2010
ESOMAR will hold its annual congress, themed 'Odyssey 2010 - The Changing Face of Market Research,' on September 12-15 in Athens, Greece.
AMA MARKETING RESEARCH CONFERENCE
September 26-29, 2010
The American Marketing Association will hold its annual marketing research conference on September 26-29 at the Hilton Atlanta in Atlanta.

View more Related Events...

Related Discussion Topics

TURF Simulator
01/11/2010 by William Bailey
TURF Simulator
01/08/2010 by Manmit J. Shrimali
TURF in Excel
07/14/2009 by William Bailey
TURF excel-based simulator
07/13/2009 by Kris Kumar
Stat testing / Bonferroni correction
05/06/2009 by Ian L. Straus

View More

Related Glossary Terms

Search for more...