A comparison of missing value options in regression analysis | Articles

Abstract

Regression analysis is one tool for evaluating customer satisfaction measurement. Non-response is problematic for multiple regression analysis because most software discards all of a respondent’s data when it encounters a missing value. This article discusses options for coping with item non-response in regression runs, comparing run results based on a real data set.

Editor's note: Gary M. Mullet, Ph.D., is president of Gary Mullet Associates, Lawrenceville, Ga.

Whenever you manage to get off the telephone long enough to even glance at your in-box, you're sure to notice that a large amount of correspondence deals with various facets of customer satisfaction measurement (CSM). It also seems that more and more promotion, compensation and retention decisions are based, at least in part, on the results of CSM studies.

One tool, although certainly not the only one, for evaluating such studies is regression analysis. As readers of this column are aware, regression analysis is certainly widely used in other types of marketing research studies. One bugaboo of multiple regression analysis is item non-response. When (most) computer packages encounter a missing value, they pitch all of the other data from the given respondent, by default.

There are various options for coping with item non-response in regression runs. We will compare the results of some of these below, using a real, albeit disguised, data set. If your livelihood depends on the results of a CSM study, you should be interested in the differing conclusions which may be drawn from these comparisons. All of the results reported below use a 95 percent confidence (5 percent significance) criterion and stepwise regression runs. There are certainly myriad other options available which are not examined below.

Listwise deletion

As already noted, the default option in most programs is listwise deletion. In a very small nutshell, this means that if a respondent fails to answer even one of the many ratings, that respondent ceases to exist for the regression in question. As a case in point, a recent regression on 1200+ respondents yielded not a single valid case for a regression trying to use only 15 (out of 60-some) independent variables to predict overall satisfaction. While this is extreme, it is not unusual to lose 50 percent or more of the respondents to item non-response. Thus, conclusions (and compensation) may be based on fewer than half of the respondents in your carefully designed study!

Our example comes from a data set of 500 respondents who were asked 10 ratings that were potentially related to an overall opinion measure. For proprietary reasons, the 10 scales used for the independent variables will be denoted below as X1, X2, . . .X10, rather than given more meaningful labels. The results of the first regression, using the listwise (default) option, are noted in Table 1 under column A.

As a variation on listwise deletion, some analysts use a portion of the column A results only to see which set of variables is significant and then instruct the computer to run another regression, using only those attributes and pretending that the others don't exist. This can accomplish a couple of things. First, almost assuredly, the base size will increase since fewer variables require answers from everyone. Secondly, (partial) regression coefficient magnitudes may change, as well as order of entry of the variables - just look below. In some cases, attributes that are statistically significant in the first pass through the data will not be so in this second pass. The results from this "variable screening" analysis are listed under column B.

Pairwise option

In this variation of regression, attributes are (essentially) looked at two-by-two (sounds like Noah's Ark). Without beating anyone over the head with statistical theory, the effect of invoking this option changes the matrix upon which the computer program operates to find the estimated regression coefficients. The results of the pairwise option are under column C, in Table 1.

Mean substitution

Be careful here! Mean substitution for missing values is a very attractive option since it's easy to invoke - just push a computer key - and dramatically increases the base size on which these personnel and/or other decisions are made. The mean substitution option fills in the arithmetic mean value for everyone who did answer a given rating for the void existing for those who did not. Thus, everyone is assumed to be "average" on anything that they failed to answer.

Then why be careful? First, if you blithely select mean substitution without any filtering of the data, the mean on the dependent variable, here overall opinion, is also substituted for those who didn't answer it. You will then be running regressions that include a substantial number of people who did not give a rating on the criterion measure - be they no longer customers, no longer product users or whatever. See column D for this type of mean substitution.

O.K., let's say you're alert enough to run the mean substitution option on only those who gave an answer to the overall opinion question. The results, in column E of Table 1, still include several respondents who answered only one or two of the independent variable ratings, which may cause an eyebrow or two to be raised if the results are broadcast.

Finally, let's look at more intelligent mean substitution. You need to ask yourself, "How many questions should a respondent answer to convince me that they have a grasp of the interview?" For the data which we are looking at, the answer to this (arbitrarily) was set at eight. Then, mean substitution was used for those who met two criteria. One, there had to be a valid answer to the overall opinion question. Second, there had to be at least eight legal answers to the 10 predictor attribute ratings. The regression coefficients are shown in column F of the table.

Respondent mean substitution

Many feel that the major drawback to using the automatic mean substitution option is that an individual with missing values is treated liked everyone else; the mean of all who did answer is substituted as the value for those who did not, as already noted, variable-by-variable. Respondent mean substitution treats each individual as an independent entity; the mean for the questions that were answered (which may require some reverse coding) for each individual respondent is substituted for the value(s) for which there is no answer for that respondent and that respondent only. This, then makes use of scale usage differences between individuals or genuinely different (average) ratings on the independent variables between individuals. As before, the resulting regression may be run irrespective of the number of ratings which a respondent did answer, but in column G you'll find the results of substituting the respondents' own mean for items which had no answer for, as before, those who answered at least eight of the predictors and also gave an overall opinion rating.

Are we done yet?

Just about. We'll leave perusal of Table 1 to the reader during your scarce leisure time. Note, however, that there are some common and uncommon threads between the columns. Depending on your actual application of regression analysis, none of these differences may be daunting at all. Certainly, in some applications they are somewhat scary.

It should be obvious by now that there are still other analytical variations, such as using the pairwise option on the respondent mean substitution data. That's not the point. The important conclusion to draw from the above mathematical manipulations is, it is essential for the analyst to know exactly which options are used on any regression analysis before blindly trying to implement the results, whether they be for sales force compensation, new product share forecasting, brand image analysis or whatever. As always, clear, careful, concise communication is what it's all about. And please, please don't use total mean substitution just to be able to show a regression base equal to the number of questionnaires in hand. While that sounds like a no brainer, it has been done.