Skip to: Main Content / Navigation

A comparison of missing value options in regression analysis



Article ID:
19951211
Published:
December 1995
Author:
Gary M. Mullet

Article Abstract

Regression analysis is one tool for evaluating customer satisfaction measurement. Non-response is problematic for multiple regression analysis because most software discards all of a respondent’s data when it encounters a missing value. This article discusses options for coping with item non-response in regression runs, comparing run results based on a real data set.

Editor's note: Gary M. Mullet, Ph.D., is president of Gary Mullet Associates, Lawrenceville, Ga.

Whenever you manage to get off the telephone long enough to even glance at your in-box, you're sure to notice that a large amount of correspondence deals with various facets of customer satisfaction measurement (CSM). It also seems that more and more promotion, compensation and retention decisions are based, at least in part, on the results of CSM studies.
One tool, although certainly not the only one, for evaluating such studies is regression analysis. As readers of this column are aware, regression analysis is certainly widely used in other types of marketing research studies. One bugaboo of multiple regression analysis is item non-response. When (most) computer packages encounter a missing value, they pitch all of the other data from the given respondent, by default.
There are various options for coping with item non-response in regression runs. We will compare the results of some of these below, using a real, albeit disguised, data set. If your livelihood depends on the results of a CSM study, you should be interested in the differing conclusions which may be drawn from these comparisons. All of the results reported below use a 95 percent confidence (5 percent significance) criterion and stepwise regression runs. There are certainly myriad other options available which are not examined below.

Listwise deletion
As already noted, the default option in most programs is listwise deletion. In a very small nutshell, this means that if a respondent fails to answer even one of the many ratings, that respondent ceases to exist for the regression in question. As a case in point, a recent regression on 1200+ respondents yielded not a single valid case for a regression trying to use only 15 (out of 60-some) independent variables to predict overall satisfaction. While this is extreme, it is not unusual to lose 50 percent or more of the respondents to item non-response. Thus, conclusions (and compensation) may be based on fewer than half of the respondents in your carefully designed study!
Our example comes from a data set of 500 respondents who were asked 10 ratings that were potentially related to an overall opinion measure. For proprietary reasons, the 10 scales used for the independent variables will be denoted below as X1, X2, . . .X10, rather than given more meaningful labels. The results of the first regression, using the listwise (default) option, are noted in Table 1 under column A.
As a variation on listwise deletion, some analysts use a portion of the column A results only to see which set of variables is significant and then instruct the computer to run another regression, using only those attributes and pretending that the others don't exist. This can accomplish a couple of things. First, almost assuredly, the base size will increase since fewer variables require answers from everyone. Secondly, (partial) regression coefficient magnitudes may change, as well as order of entry of the variables -- just look below. In some cases, attributes that are statistically significant in the first pass through the data will not be so in this second pass. The results from this "variable screening" analysis are listed under column B.

Pairwise option
In this variation of regression, attributes are (essentially) looked at two-by-two (sounds like Noah's Ark). Without beating anyone over the head with statistical theory, the effect of invoking this option changes the matrix upon which the computer program operates to find the estimated regression coefficients. The results of the pairwise option are under column C, in Table 1.
Mean substitution
Be careful here! Mean substitution for missing values is a very attractive option since it's easy to invoke -- just push a computer key -- and dramatically increases the base size on which these personnel and/or other decisions are made. The mean substitution option fills in the arithmetic mean value for everyone who did answer a given rating for the void existing for those who did not. Thus, everyone is assumed to be "average" on anything that they failed to answer.
Then why be careful? First, if you blithely select mean substitution without any filtering of the data, the mean on the dependent variable, here overall opinion, is also substituted for those who didn't answer it. You will then be running regressions that include a substantial number of people who did not give a rating on the criterion measure -- be they no longer customers, no longer product users or whatever. See column D for this type of mean substitution.
O.K., let's say you're alert enough to run the mean substitution option on only those who gave an answer to the overall opinion question. The results, in column E of Table 1, still include several respondents who answered only one or two of the independent variable ratings, which may cause an eyebrow or two to be raised if the results are broadcast.
Finally, let's look at more intelligent mean substitution. You need to ask yourself, "How many questions should a respondent answer to convince me that they have a grasp of the interview?" For the data which we are looking at, the answer to this (arbitrarily) was set at eight. Then, mean substitution was used for those who met two criteria. One, there had to be a valid answer to the overall opinion question. Second, there had to be at least eight legal answers to the 10 predictor attribute ratings. The regression coefficients are shown in column F of the table.

Respondent mean substitution
Many feel that the major drawback to using the automatic mean substitution option is that an individual with missing values is treated liked everyone else; the mean of all who did answer is substituted as the value for those who did not, as already noted, variable-by-variable. Respondent mean substitution treats each individual as an independent entity; the mean for the questions that were answered (which may require some reverse coding) for each individual respondent is substituted for the value(s) for which there is no answer for that respondent and that respondent only. This, then makes use of scale usage differences between individuals or genuinely different (average) ratings on the independent variables between individuals. As before, the resulting regression may be run irrespective of the number of ratings which a respondent did answer, but in column G you'll find the results of substituting the respondents' own mean for items which had no answer for, as before, those who answered at least eight of the predictors and also gave an overall opinion rating.

Are we done yet?
Just about. We'll leave perusal of Table 1 to the reader during your scarce leisure time. Note, however, that there are some common and uncommon threads between the columns. Depending on your actual application of regression analysis, none of these differences may be daunting at all. Certainly, in some applications they are somewhat scary.
It should be obvious by now that there are still other analytical variations, such as using the pairwise option on the respondent mean substitution data. That's not the point. The important conclusion to draw from the above mathematical manipulations is, it is essential for the analyst to know exactly which options are used on any regression analysis before blindly trying to implement the results, whether they be for sales force compensation, new product share forecasting, brand image analysis or whatever. As always, clear, careful, concise communication is what it's all about. And please, please don't use total mean substitution just to be able to show a regression base equal to the number of questionnaires in hand. While that sounds like a no brainer, it has been done.

Page Tools
Bookmark and Share

Related Suppliers: Research Companies from the SourceBook

Click on a category below to see firms that specialize in the following areas of research and/or industries

Specialties

Conduct a detailed search of the entire Researcher SourceBook directory

Related Articles

There are 756 articles in our archive related to this topic. Below are 5 selected at random and available to all users of the site.

Trade Talk: School is now in session
A review of three books: Market Research in Practice - a Guide to the Basics, Market Intelligence - How and Why Organizations Use Market Research, The Effective Use of Market Research - How to Drive and Focus Better Business Decisions.
Establishing a bond
Marketing to culturally-diverse targets is a discipline that has grown in sophistication during the past 20 years in the United States. This article discusses Spanish-language marketing and marketing research, noting how understanding the role of language will aid the practitioner in establishing an emotional bond.
Data Use: Marketplace segmentation by demographic characteristics
Researchers today have an abundance of information and sophisticated tools with which to work. Still, the best trained statisticians are not always clear how best to use these materials and methods. This article addresses geo-demographic clusters, which are marketed by their developers as the definitive answer to market-segmentation problems.
Linking health care research methodologies
Focus group and survey research services gradually have emerged as the two most popular examples of qualitative and quantitative research. There is a tendency to pit quantitative against qualitative in a manner that assumes their mutual exclusivity and forces users to choose between them. This article discusses how a combination of focus group and survey research can produce better results compared to using either approach separately.
Analytical software extends its reach
The goal of this article is to clarify information on data mining and related topics (including “data warehousing” and “knowledge discovery”). Also discussed are software products (SPSS 10.0, DBMS/COPY 7.0, and SYSTAT Version 9.0) that can help with data mining.

See more articles on this topic

Related Events

DATA MATTERS CONFERENCE
February 17, 2010
Research Magazine will hold a conference, themed 'Data Matters,' on February 17 at the Mayfair Conference Centre in London.
RIVA COURSE 241: QUALITATIVE ANALYSIS AND REPORTING
February 18-19, 2010
RIVA Training Institute will hold a course, themed 'Qualitative Analysis and Reporting' on February 18-19 in Rockville, Md.

View more Related Events...

Related Discussion Topics

TURF Simulator
01/11/2010 by William Bailey
TURF Simulator
01/08/2010 by Manmit J. Shrimali
TURF in Excel
07/14/2009 by William Bailey
TURF excel-based simulator
07/13/2009 by Kris Kumar
Stat testing / Bonferroni correction
05/06/2009 by Ian L. Straus

View More

Related Glossary Terms

Search for more...