Editor’s note: Susie Sangren is president of Clearview Data Strategy, an Ithaca, N.Y., consulting firm.

Most researchers are already familiar with univariate statistical methods. Multivariate statistics are developed for analysis of data which has more complex, multi-dimensional, dependence/interdependence structures. For example:

  • Data is divided into a dependent group, as well as an independent group of variables. Researchers are interested in finding the causal relationship between the independent and dependent groups of variables. They would choose such multivariate methods as: multivariate multiple regression, discriminant analysis, conjoint analysis, crosstab ANOVA analysis, categorical analysis, and logistic regression.
  • Data may be viewed as one big group of variables serving the same purpose. Researchers are interested in the interdependence structure of the data. Their focus is to restate the original variables in an alternative way to better interpret the meaning, or to group observations into similar patterns. They would choose such multivariate methods as: principal component analysis, factor analysis, canonical correlation analysis, cluster analysis, and multidimensional scaling.
  • Data may not be normally distributed. Researchers are not concerned about making broader inferences about the population, but about the analysis of the specific data at hand. They would need multivariate methods that are tolerant of non-standard, non-metric data which is less likely to be normally distributed.

Multivariate methods are derived from univariate principles, but are more empirical because they work backward from data to conceptualization. For many marketing applications, multivariate methods can outperform univariate methods.

Marketing problems are inherently multi-dimensional, and solutions often inexact. For example, customer types are classified along a range of customer characteristics; stores and brands are perceived and evaluated with respect to many different attributes; creditworthiness of a credit card applicant is judged on a variety of financial information. Multivariate methods are versatile tools allowing researchers to explore for fresh knowledge in huge consumer databases. They are used for market segmentation studies, customer choice and product preference studies, market share forecasts, and new product testing. Results are used in making decisions about: strategy for target-marketing campaigns, new product or service design, and existing product refinement.

Multivariate methods are popular among marketing professionals also because of their tolerance of less-than-perfect data. The data may violate too many univariate assumptions; they may be survey data with too much variable information and not enough observations (e.g., researchers ask too many redundant questions with too few respondents); or they may have problems resulting from poor sample or questionnaire designs.

Key characteristics of multivariate procedures

The research objective should determine the selection of a multivariate method. In this article, "observations" refers to entities such as people, subjects; "variables" (sometimes called "dimensions") is the characteristics of these entities measured in quantitative or qualitative terms. Data consists of both observations and variables.

1. Principal component analysis
Principal component analysis restates the information in one set of variables in an alternate set of variables on the same observations. Principal components are linear combinations of the original variables such that they are mutually independent. This method applies orthogonal rotation of data along the axes that represent the original set of variables. Orthogonality ensures the independence of all components, while preserving 100 percent of the variance (synonymous with information) in the original variables. There can be as many variables as there are principal components.

1.1 Applications

  • Variable reduction. Because principal components are extracted in decreasing order of variance importance, much of the information in the original set of variables can be summarized in just the first few components. Therefore you can often drop the last few components without losing much.
  • Principal component scores can be used as independent predictors in regressions, thereby avoiding collinearity problems. Because principal components are orthogonal to (independent of) each other, they are an excellent choice for regressions where the original independent variables may be highly correlated.
  • Outlier detection. Outliers are observations with different behavior from the rest of the observations. They may be caused by measurement errors, but they can exert undue influence on the regression. It is easy to find outliers in a one- or two-dimensional space defined by one or two variables. With higher dimensions defined by more variables, it becomes difficult to find their joint outliers. Principal components analysis can help locate outliers in a higher dimensional space.

1.2 Example
Do a regression analysis predicting the number of baseball wins from the following baseball statistics (many of them redundant), from the 1990 professional baseball season:

Dependent variable - number of wins.

Independent variables - batting average; number of runs; number of doubles; number of home runs; number of walks; number of strikeouts; number of stolen bases; earned run average; number of complete games; number of shutouts; number of saves; number of hits allowed; number of walks allowed; number of strikeouts; league.

1.3 Limitations
It is impossible to interpret the principal component scores computed for the observations. They are merely mathematical quantities for variable transformation. In regression, use of principal components in place of original variables is solely for prediction purposes. The resulting R2 and coefficients may be significant, but the principal component scores do not have clear meanings themselves. If you want to give meaningful interpretation to the components, you are better off doing a factor analysis instead.

2. Clustering of variables or observations
Clustering is a collection of ad hoc techniques for grouping entities (either observations or variables) according to a distance measure specified by the researcher. The distance measure is a pairwise proximity between observations based on all available variables. If this distance measures similarity, such as the squared correlation, then it is the "similarity proximity." If this distance measures dissimilarity, such as the Euclidean distance, then it is the "dissimilarity proximity."

Once the choice of distance measure is made, a clustering algorithm groups members in all possible ways, each round calculating the values for an objective function (e.g., sum-of-squared-error, SSE, between clusters) using the predetermined distance measures. Finally it settles on a cluster configuration that optimizes this objective function (e.g., giving the highest SSE between clusters for separating them).

2.1 Variable clustering method
Variable clustering, like principal components analysis, is a technique for investigating the correlation among variables. The goal is to reduce a large number of variables to a handful of meaningful, interpretable, non-overlapping ones for further analysis. Clustering uses an oblique rotation of the variables along the principal-component axes to assign each of the variables individually to one of the rotated component axis-clusters with which it has the highest squared multiple correlation. Oblique rotation, contrary to orthogonal rotation, permits variables to be somewhat correlated.

2.2 Clustering applications

  • Variable reduction. Variable clustering is often used for variable reduction purposes. After dividing the variables into clusters, you can calculate a cluster score for each observation for each of the clusters. In a regression where you have an inordinate amount of potential independent variables, you can do a cluster analysis first to reduce the number of independent variables. Once the variable clusters are found, you can regress your dependent variable on the clusters instead of the original variables. Better yet, you can even pick the variables that best represent the clusters, and use the reduced set of variables for your regression.

Unlike principal components scores where each score is a linear combination of all variables (and there are many scores), cluster scores are simpler to interpret. A variable either belongs or doesn’t belong in a cluster, making the interpretation a lot cleaner.

  • Grouping entities. Observations or variables are grouped based on their overall similarity. Although clustering makes no attempt to look inside the cluster members, the resulting clusters need names or labels for identification. It doesn’t have to be precise: You inspect the within-cluster values of the variables, and compare them with the between-cluster values of the same variables to differentiate the cluster characters. Since the cluster labels should reflect the larger differences in those variable values, you may even discover interesting patterns in the groupings.

2.3 Example of clustering observations
The Air Force trains recruits for many jobs. It is expensive to design and administer a training program for every job. Therefore, data (variables) is collected on each of the jobs (observations), and then the jobs are clustered. Clustering enables the Air Force to design training programs for the entire clusters of jobs, not just for specific jobs.

3. Factor analysis
Factor analysis is useful for exploring and understanding the internal structure of a set of variables. It describes the linkages among a set of "observable" variables in terms of "unobservable" or "underlying" constructs called factors.

In principal component analysis, you construct new variables from all the observed variables; whereas in factor analysis, you reconstruct the observed variables from two new types of underlying factors: the estimated common factors (to all observations) and the unique factors (to individual observations). The importance is in the study of the factor loadings - the coefficients for estimating the observable variables from the underlying factors.

The general form for factor analysis looks like a linear ANOVA model:

Yij = µi + Eij,

µi = ßkXik

where Yij is the observed value of subject i on variable j,

µi is the subject i mean value for the common underlying factors, Xk

Eij is the error term unique to subject i on variable j, after accounting for µi,

ßk is the factor loading estimates for the unknown common factors Xk.

ßk is the same for all subjects on factor k.

In a linear model, you are given the values for X and Y, so that it is possible to solve for a unique ß and E. With a factor model, X (the common factors), ß (the factor loadings), and E (the unique factors) are all unknown and have to be estimated. You can have an infinite number of solutions to factor loadings, all of which fit the data well. There lies the first indeterminacy of factor analysis. To obtain a unique solution, you must impose the constraint that all the common and unique factors be mutually uncorrelated.

After you estimate the factor loadings, any orthogonal rotation of these estimates to derive the X factors would work equally well to preserve the variance information. For convenience, researchers often rotate the factor axes so that they can better interpret the resulting axes - factors. There lies the second indeterminacy of the factor analysis.

Because of the amount of guesswork involved in doing a factor analysis (it is a fishing expedition), you can have different findings with different groups of respondents, with different ways of obtaining data, and with different mixes of variables. The technique should only be used as an exploratory tool to help untangle badly tangled data, which should then be followed up by a confirmatory analysis like the regular ANOVA.

3.1 Application

  • Exploratory variable analysis. If you can gain insights into the underlying factors, you may be able to learn something interesting about the observable variables, or you may even derive causal relations of how these factors influence the observed variables. Also, if you can show that a small number of underlying variables can explain a large number of observed variables, then you can significantly simplify your research.

3.2 Example
A car dealer has asked for several customers’ preference ratings on a variety of car models made by Mercedes, BMW and Toyota. Using factor analysis, the researcher is able to identify three common factors that underlie all customers’ preference ratings for these cars: style-consciousness, price-consciousness, and performance-consciousness. The common factor loadings are estimated for each car showing the extent to which customer ratings for this car depend on the degree of their preferences for these three underlying factors. For example, a researcher may discover that the rating of the Mercedes sedan would load heavily on customers’ propensity toward style-consciousness and performance-consciousness.

4. Multidimensional scaling (MDS)
MDS is a descriptive procedure for converting pair-wise proximity measurements among objects (observations mostly, variables occasionally) into a geometric representation in a two-dimensional space. The goal is to plot it.

The method requests one single input variable: a set of pair-wise proximity measures on objects using all the variable information. MDS then applies iterative optimization/transformation procedures to derive new configurations, projecting the proximity onto a lower dimensional space. MDS deals with configurations rather than groupings of the objects. It finds the relative locations of objects in a two-dimension space for plotting purposes.

The proximity measure can take many forms: either as an absolute value (e.g., distance between two points), or as a calculated value (e.g., correlation between two variables). These are examples of proximity measures:

  • physical distance between two locations;
  • psychological measure of similarity or dissimilarity between two products, as viewed by the objects;
  • a measure that reflects how well two things go together; for example, two kinds of foods served in a meal.

4.1 Application

  • Geometric representation of objects, and outlier detection. MDS enables you to create a plot of points in a two-dimensional space such that distances between points reflect the degree of their similarity or dissimilarity. These points can be objects representing anything you want (e.g., brands of product, groups of people, geographic locations, political candidates). Data for MDS analysis can be metric or non-metric, and they need not be absolutely precise (in that case, you would be drawing an approximate map).

By studying the spread of the data points on a plane, you may discover unknown variables (or dimensions) that affect the similarity and dissimilarity values, or the outliers that are distant from all other points.

4.2 Example
A market research firm was interested in knowing how customers perceive the similarities between various snack foods. They selected 14 popular snack foods and asked six subjects to rank every pair of snacks. MDS was used to transform the proximity data and plot the points (snacks) on a two-dimensional space. Points that were relatively close represent the snacks that were judged to be similar by the customers.

You can expand the research if you have existing sales data for each of the snacks. You can build a regression model of sales on the properties represented by the two new dimensions (e.g., saltiness and crunchiness). Each snack food receives a score for each of the dimensions. The results of the regression can help you design a new snack food with properties found in an area of the plot where there are no snack foods, but promises high sales as predicted by the regression.

5. Discriminant analysis
Discriminant analysis is a model-based regression technique for classifying individual observations into one of several groups based on a set of "continuous" discriminator (independent) variables. The modeled (calibrated) relationship can be applied to new members to predict their group membership.

In a discriminant analysis, the researcher selects the observations first before measuring their values on the discriminator variables to avoid violations of method assumptions. Discriminant analysis is not appropriate for situations where the researcher wants to select observations to guarantee that wide ranges of values are included for the independent variables (in such a case, use logistic regression instead).

5.1 Applications

  • Discovering the discriminant function that optimally discriminates among groups, and learning how they work. Discriminant analysis and cluster analysis both form groups. In a discriminant analysis, the group membership in the sample data is known, and the procedure is concerned with finding the meaningful relationship between group memberships and the discriminators. In a cluster analysis, the groupings are not known ahead of time, and the sole purpose is to find group memberships based on a composite distance measure on all possible variables.
  • Grouping of observations. Discriminant analysis uses the linear discriminant function to predict group memberships for a set of data points.

5.2 Example
Before granting a credit card to a customer, a bank would want the assurance that the potential customer is a good credit risk. The bank may build a discriminant model of people whose credit behavior is known, based on such discriminator variables as their income, amount of previous credit, length of time employed. The modeled relationship can be applied to new applicants’ values on the discriminator variables, which are known, to predict their future credit behavior.

5.3 Limitations
With discriminant analysis, you can examine the extent to which the groups differ on the discriminators, which you cannot with ad-hoc procedures like cluster analysis.

It is very important to cross-validate the results of a discriminant analysis. A discriminant analysis would give spurious results when it classifies the same observations it used to develop the functional relationship. (The misclassification rate of new data would be higher than what the model predicts.) To properly cross-validate the discriminant function, you should have sufficient observations in your sample so that a portion of them can be used to develop the function, and the other portion used to cross-validate your result.

6. Canonical correlation analysis, or multivariate multiple regression
Like the univariate multiple regression, you have several independent variables in the multivariate multiple regression; however, unlike univariate regression, you have several dependent variables in the multivariate multiple regression. The goal of multivariate multiple regression is to find the joint effect of independent variables on all dependent variables simultaneously.

You can do a separate univariate regression for each of the dependent variables. The problems with this approach are:

  • you would have a separate series of prediction equations for each dependent variable;
  • you would have no multivariate information about the relationship of the set of independent variables with the set of dependent variables;
  • you would have no information about the relationship within the dependent variables or the relationship within the independent variables.

6.1 Canonical correlation analysis
A canonical correlation analysis enables you to discover the linear relationship between two sets of variables, without regard to which set is the independent variables and which set is dependent variables. In a multivariate multiple regression, the canonical correlation (or, multivariate R2) is equivalent to the correlation (univariate R2) in a univariate multiple regression.

  • Canonical correlation analysis is able to simplify the following problems:
  • The dependent variables may be measuring redundant information. (A subset of them, or a smaller set of their linear combinations, may be sufficient.)
  • The independent variables may be measuring redundant information. (A subset of them, or a smaller set of their linear combinations, may be sufficient.)
  • The linear combinations of the independent variables may serve as predictors of the linear combinations of the dependent variables. This would reduce the complexity of the analysis that may actually give you better insights into the data.

Canonical correlation analysis does this by a redundancy analysis, finding successive linear combinations of the variables in each of the two sets such that:

  • Each linear combination of the variables in one set is independent of the previous linear combinations of variables in the same set.
  • The correlation of any pair of linear combinations between two separate sets is the highest correlation there can be, subject to the constraint that they have to be orthogonal to previously selected pairs.

6.2 Applications of canonical correlation analysis

  • Redundancy analysis. The result of the analysis is a set of canonical variables: linear combinations of the original variables that optimally correlate with the corresponding linear combinations in another set. By examining the coefficient and correlation structures of the variables used in forming the canonical variables, you know the proportion of variance in one set of variables as explained by the other set. These proportions, considered together, are called redundancy statistics.

Redundancy statistics are in fact the multivariate R2 in a multiple multivariate regression. By performing a canonical analysis on two sets of variables, one set identified as independent variables and the other set as dependent variables, you can calculate the redundancy statistics that estimate the proportions of variance in the dependent variable set that the independent variable set can explain.

6.3 Example

A pharmaceutical company is interested in comparing the efficacy of a new psychiatric drug against an old drug, each at three different dosage levels. Six patients are randomly assigned to one of the six drug-dose combinations.

There are a set of three dependent variables: the three sets of gain scores from three psychological tests conducted on the patients before and after the trial. These scores are: HDRS (Hamilton gain score), YBOCS (Yale-Brown gain score), and NIHS (National Institute of Health gain score). There are a set of three independent variables in the model: drug (new or old) and dosage level (50, 100, or 200 mg of the drugs), and prior physical conditions.

7. Conjoint analysis
Conjoint analysis is used to analyze product preferences and to simulate consumer choice. It is also used to study the factors that influence consumers’ purchasing decisions. Products can be characterized by attributes such as price, color, guarantee, environmental impact, reliability, etc. Consumers typically do not have the option of buying the product that is best in every attribute, particularly when one of those attributes is price. Consumers are constantly making trade-off decisions when they purchase products (e.g., large car size means increased safety and comfort, which must be traded off with higher cost and pollution). Conjoint analysis studies these trade-offs, under the realistic condition that many attributes in a product are presented to consumers together.

Conjoint analysis is based on an additive, simple main-effect analysis-of-variance model. This model assumes no interactions among the attributes (which may be unrealistic). Data is collected by asking participants about their preference ratings for the overall products defined by a set of attributes at specified levels. Conjoint analyses are performed for each customer, but usually the goal is to summarize or average the results across all participating consumers.

For each consumer, conjoint analysis decomposes his original overall ratings into part-worth utility scores for each of the attribute levels. Total utility for a particular product (viewed as a combination of attributes and their levels) is the sum of the relevant part-worth utilities. Large utilities indicate the preferred combinations, and small utilities the less preferred combinations. The attributes with the widest utility range are considered to be the most important in predicting this consumer’s preference. The average importance probability for an attribute across all consumers indicates the overall importance of this attribute.

A consumer’s total utilities estimated for each of the attribute-level combinations of a product are then used to simulate expected market share - the proportion of times that a product combination would be purchased by him. The maximum utility model is often used to simulate market share. The model assumes that a customer will buy with 100 percent probability the product combination for which he has the highest utility, and 0 percent for all other combinations. The probabilities for each of the product combinations are averaged across consumers to get an overall market share.

7.1 Example
A consumer is asked to rate his preference for eight chocolate candies. The covering is either dark or milk chocolate, the center is either hard or soft, and the candy does or does not contain nuts. Ratings are on a 1 to 9 scale where 1 indicates the lowest preference and 9 the highest. Conjoint analysis is used to determine the importance of each attribute of the product to the consumer, and his utility score for each level of an attribute.

7.2 Limitations
The deficiency with conjoint analysis is its lack of error degrees of freedom (too few observations, and too many variables). It is like conducting an ANOVA analysis on one subject over all levels of the attributes. The R2 with data points extracted from one individual would always be high, but that does not guarantee good fit or the model’s predictive power. Researchers prefers the main-effects model because it requires the fewest parameters to estimate, alleviating the burden of not having enough error degrees of freedom.

The second problem is the complexity of designing the experiment. The simplest design would be the full-factorial design, requiring all possible combinations of the levels of the attributes given to the consumers for rating. When the number of attributes is large (say six or more), and the number of levels for each attribute is large (say four or more), it is tiring for anyone to rate that many combinations. For a small number of attributes and levels, you can choose an orthogonal design leading to uncorrelated part-worth estimates.

While conjoint analysis is ideal for new product design, researchers are advised to confirm the conjoint analysis result with a standard ANOVA analysis, even though the latter works better with continuous measurements.

Tremendous firepower

There is tremendous firepower in multivariate methods for solving difficult marketing decision problems. However, these methods are theoretically complex and prone to misunderstanding and abuse. Skilled researchers should take special care to avoid pitfalls in analysis and eliminate the erroneous decisions.