Editor's note: Mike Fassino, Ph.D., is president of EnVision Knowledge Products, Media, Pa. This is the first installment of a three-part series on neural networks.

Throughout the 20th century, statistical models and procedures have dominated the practice of quantitative market research. The popularity of conjoint analysis and perceptual mapping procedures as well as the ubiquity of t- and Chi-square tests in the major crosstab packages attest to our hunger for the seeming "precision" of statistics. In some fundamental ways, however, statistical models are anachronistic, having been developed in an era when calculating was extraordinarily expensive but thinking was not. Today, the economics are reversed: the pervasiveness of personal computers has lowered the cost of a calculation to virtually nothing, so the value of market research has come to lie in thinking about what the calculations mean rather than doing the calculations. While our calculating machinery has vastly improved, most of the statistical techniques market researchers rely on continue to be:

  • linear
  • orthogonal
  • normal
  • correlational rather than causal

These four assumptions of classical statistics - linearity, orthogonality, normality and acausality - provide convenient shortcuts to calculating. For example, if I want to assess the nature and strength of the relationship between five or six dependent variables (such as components of customer satisfaction on a single dependent variable like overall satisfaction) and I am willing to believe that the relationship is linear, that the independent variables are not correlated with each other and that the errors of the model are normally distributed, I could do the calculating by hand. If, however, I allow for nonlinearity, colinearity and non-normality, the calculations become enormously complex and I would have to be willing to spend years calculating this one problem by hand. With a reasonably good PC, however, these years of calculations can be performed in a few minutes.

In this series of three articles, we will examine the application of neural networks to the analysis of quantitative market research data. Neural networks approach the task of data analysis from a different perspective than classical statistical procedures, being far more computer-intensive so that more of the researcher's time can be spent thinking about what the results mean instead of transforming variables so that the data (almost) fit the assumptions and preconceived constraints of the statistical procedures. Neural networks rarely "care" whether the classical assumptions are met and rely on brute force calculating rather than statistical theory to solve analytic problems.

As the name implies, neural networks were, originally at least, concerned with designing software to emulate the way the nervous system works. As we all know, the human brain contains millions of nerve cells, called neurons, that communicate with each other with electrical and chemical impulses. Each nerve cell synthesizes all of the impulses from neighboring nerve cells and decides, based on this synthesis, whether it will send an impulse. Today, emulating the way the brain works is still an important area of neural network research and development. It has been demonstrated, however, that certain classes of neural networks are adept at solving difficult statistical problems. Whether they simulate or emulate the way the brain works is not relevant - they are useful analytic tools, apart from their neurobiological ancestry. The most extensively used type of neural network is called back-propagation and it is the focus of this article.

Back-propagation

Understanding back-propagation is essential for understanding the self-organizing map and time-series forecasting neural networks that I cover in the two companion articles. It is a fundamental neural network architecture.

Figure 1 shows the essential features of a back-propagating neural network. There are three components, labeled A through C. Component A is the input layer. Here the neural network obtains information about the value of the independent or input variables, such as ratings of customers' satisfaction with various facets of a company's product or service. Component B contains a series of hidden units. This is where the neural network's heavy calculations occur. Most of this article is about what goes on in the hidden layer. Component C is the output layer, where the neural network provides its estimate of the dependent or outcome variable's value, such as a rating of overall satisfaction. The data requirements for a back-propagating neural network are:

  • values of the independent variables
  • values for the dependent or outcome variable.

Because values of both the independent and dependent variable are provided to the network, this particular type of neural network is known as a supervised learning network. In the companion articles, we will encounter two other types of networks where the network is not given values of both the independent and dependent variables: unsupervised learning and reinforcement learning.

Both the independent and dependent variable(s) can be categorical, ordinal, interval or ratio scaled. Rather than solve for the relationship between input and output variables as is done in regression or conjoint analysis, a back-propagating neural network is trained, or "learns," the relationship. Since it simply learns the relationship, there are very few assumptions about the form of the relationship - it can be highly nonlinear and nonstationary (e.g., changing over time) and the independent variables can be arbitrarily correlated (three conditions that cause linear statistical techniques considerable trouble).
Notice in Figure 1 that there are lines connecting each of the independent variables to each of the processing units in the hidden layer. Similarly, there are lines connecting each processing unit in the hidden layer to the output layer. Each of these lines represents a weight. This network is known as fully connected since each and every independent variable connects to each and every unit in the hidden layer and each processing unit in the hidden layer connects to the output layer, but there is no direct connection from the input to the output layer: the relationship between inputs and output is completely mediated by the hidden layer.

The process of learning is achieved by slowly and systematically adjusting each of the weights (remember, each line represents a weight) so that the network's estimate of the output variable, based on a weighted combination of the input variables' values, closely matches the actual output variable's value. The process of slowly and systematically adjusting the weights is referred to as training.

Back-propagation works through four steps:

Step 1: Obtain values for the independent (input) variables by randomly selecting a line of data from the database.

Step 2: Calculate an estimate for the dependent (output) variable. The network's estimate of the dependent variable is a weighted combination of the input values. Each processing unit in the hidden layer multiplies each of the independent variables values by its own idiosyncratic weight and then adds all these together. So each processing unit in the hidden layer takes on a single value composed of the weighted sum of its inputs. These single values are then transmitted from the hidden layer to the output layer. As implied in Figure 1, the single values from the hidden layer are also weighted by the output layer's idiosyncratic weights on their way from the hidden to the output layer. So, the output layer's estimate of the dependent variable is a weighted sum of the hidden layer's output (which, in turn, is a weighted sum of the input variables).

Step 3: The network's estimate of the dependent variable is compared to the actual dependent variable. Any difference between these two is called error. The network then reverses its flow, sending information about the magnitude and direction of the error downward. The information sent downward through the network (i.e., propagated backwards, or back-propagated) tells the network how much each weight should be modified so as to minimize the error.

Step 4: Once all the weights have been adjusted, a new line of data is randomly selected and the process of feeding an estimate of the dependent variable forward and information on the error backward repeats itself. After a number of such iterations, the error either reaches zero, in which case the network's estimate of the dependent variable is equal to the actual dependent variable for all cases in the database, or (as is more often the case) there is no more improvement and the error rate stays at the same, hopefully small but non-zero level, for all successive iterations.

In the discussion of these four steps we have left out a tremendous amount of detail, like whether weights are adjusted after every single line of data, or if errors are accumulated over many observations and then a more complex but complete picture of error is back-propagated, or how the network "knows" exactly which of the many weights should be adjusted at any given iteration.

The references in the bibliography at the end of the third installment will provide mind-numbing details on these important but highly technical issues.

A critical turning point in neural network research came in 1989 when it was proved with all the mathematical rigor of a proof that a neural network like that shown in Figure 1 operating through these four steps could approximate a nonlinear function of any degree of complexity to any desired level of precision as long as there was no limit placed on the number of hidden units. This proof demonstrated that back-propagating neural networks are universal approximators, since even the most complicated non-linear function could be accurately modeled using a neural network with enough processing units in the hidden layer. Deciding on the number of hidden units to use and how best to preprocess the independent variables is still very much part of the art of deploying neural networks.

We will now apply a neural network to two familiar market research problems. Our first example is a conjoint analysis presented in Paul Green's classic text Research For Marketing Decisions. Green presents an orthogonal main effects plan for a new carpet cleaner described by six product factors (package design, brand name, price, Good Housekeeping seal and money back guarantee) using 18 cards. Table 1 shows the design along with one respondent's ranking of preference for each of the 18 product configurations.

The neural network shown in Figure 1 is reproduced in Figure 2 to illustrate its connection to the conjoint design (note that the input layer has a node representing each of the product factors). The network was trained using the data shown in Table 1. Following the four steps outlined above, lines of data were randomly presented to the network and information on the error - the difference between the network's estimate of the preference rank and the actual value - was propagated backward through the network and the weights adjusted.

Figure 3 shows the correlation between the actual value of the dependent variable and the network's estimate at various points in the training process. The figure is quite typical of the course of learning: at the beginning, the network has very bad estimates of the dependent variable. As learning proceeds, the weights quickly come into alignment so that after 161 presentations, the correlation is 0.9865 and by the 233rd presentation, the correlation is perfect. The perfect correlation implies that the network exactly reproduces the 18 respondent rankings. Table 2 shows the neural network's estimate of each card's rank, as well as the rank estimated with a simple linear regression. The matrix on the right of Table 2 shows that the correlation (R) between the ranks and the regression model is 0.987 (i.e., R2=0.974), while the R2 for the neural net is 1.0.
Since each of the lines in Figure 2 represents a weight, it is a relatively simple matter to "look inside" and discover what the network has learned. For example, Figure 4 compares what the neural network and the linear regression "learned" about price. Notice that the neural network learned a nonlinear mapping between price and preference. By being forced to model this relationship linearly, the regression model's R2 suffered. Figure 5 shows that the neural net found that Designs B and C had slightly greater positive impact on preference than the conjoint model. Finally, Figure 6 illustrates that a neural network can easily provide a measure of the relative importance of the factors, just like conjoint. Figure 6 was derived by examining the network weights. Because the neural network provides better fit to the data, there are minor differences in relative importance, although the rank order is preserved.

We will conclude with one more brief example. In this case 1,250 farmers who used a particular product were asked to rate their overall satisfaction with the product on a 10-point scale and to rate the product's performance on five attributes. Farmers were divided into two groups, satisfied and unsatisfied, based on the distribution of the overall satisfaction score. We randomly selected 825 respondents and trained a neural network to predict whether a farmer would be in the satisfied or unsatisfied group from the five component satisfaction scores. We also ran these 825 respondents through discriminant analysis. We then used the neural network and the discriminant functions to predict into which group, satisfied or unsatisfied, the remaining 425 respondents fell.

The following table compares the accuracy of the two procedures:

Percent Correctly Classified
Training Testing
Neural Net 100 94
Discriminant Analysis 86 71
N 825 425

Although it is beyond the scope of this article to fully explore the superior performance of the back-propagating neural network in this simple classification task, even superficial analysis of the data reveals the main reasons:

1. The five component satisfaction scores are very highly correlated with each other, causing the discriminant analysis to suffer from multicolinearity. If the pattern of correlation in the test sample is not identical, its prediction will falter.

2. The covariance of the five components is different for satisfied and dissatisfied customers, thus violating one of the basic assumptions of discriminant analysis.

3. The components interact: respondents who think the product does very well on two attributes are much more satisfied than would be expected by looking at respondents who are satisfied with either, but not both, of the components.

4. The relationship between satisfaction and some of the components is markedly nonlinear: if performance is below a certain threshold, satisfaction is very low. Satisfaction then increases slowly before "skyrocketing." This complex non-linearity is completely lost on the discriminant analysis whereas the neural network learns it within the first few hundred presentations of data.

While the general tone of this article has been favorable to neural networks, they are not a panacea and certainly have their own inherent limitations. Four of these limitations are so severe as to warrant careful consideration:

1. In large problems with many input variables, it is very, very difficult to determine what the network has learned because of its proclivity to find nonlinear relationships. This inability to clearly document what the network learned often leads to the network being treated as a black box - it is able to predict with very high levels of accuracy, but exactly how it does this is a mystery.

2. You usually need lots of data to adequately train a neural network.

3. You cannot (easily) calculate confidence intervals or tests of significance for the weights.

4. Neural networks are prone to "overlearning" - they tend to learn so much about a database, including the random error and noise, that when you present a new set of data with different random error characteristics, the network has trouble providing accurate predictions.

5. Some neural networks take a very long time to train. For a database with 30,000 respondents and 150 variables, it might take two to three days of constant running of a Pentium 150 MHz computer.

6. There has been no strong theoretical assessment of sampling and measurement errors for neural nets, so many of the tools statisticians have come to rely on in evaluating model performance (such as confidence intervals) are unavailable.

Even though neural networks have their own limitations and problems, their appetite for nonlinear, nonstationary and highly interacting data might just make them perfect for market research. In the next article we will describe an unsupervised learning neural network, known as the Kohonen Self-Organizing Map, and show how it can be used in market segmentation and perceptual mapping contexts.