Multivariate analysis - some vocabulary

Abstract

People new to multivariate analysis can sometimes feel as though coworkers are speaking a foreign language. Gary Mullet, of the Georgia Institute of Technology, explains some of the requisite vocabulary for multivariate statistics and analysis.

Listen to this article

Editor’s note: Gary Mullet recently joined the faculty at Georgia Institute of Technology after spending over five years with Sophisticated Data Research in Atlanta. In addition to teaching and research, he is serving as a staff consultant with SDR. He has previous work experience at Burke Marketing Research and has taught at the Universities of Michigan and Cincinnati. The author of numerous articles on applied statistics, Mullet is also an active member at meetings of various professional societies.

If you've been in marketing research as a client or a vendor for any longer than five minutes, you've undoubtedly heard (or thought that you did) something that sounded like, "After we regressed the eigenvalues on the discriminated clusters from the principle components maps, the factor loadings were clustered conjointly on the razzenfritzed centroidal variated hyacinths."

Well, to the neophyte in multivariate statistics, the above might as well have been what was actually said. Seems as if there are more buzzwords in statistics than in any other science and it also seems that some researchers try to use as many of them all at once if possible. Even when we're not really trying to impress someone, we're often forced to use several confused and confusing terms, just because there are no convenient alternatives.

Anyway, below you'll find several multivariate techniques listed, and I hope, defined for the user of marketing research (as opposed to the professional statistician). Within each broad topic, I'll try to tell you what the technique will do for you and also define some of the tool-specific words. Who knows, with a little practice you, too, may be able to say things like, "We really didn't need to consider the razzenfritzed centroidal variated hyacinths in this factor analysis." In each case, we're assuming that a sample of respondents have answered several questions on your survey.

Regression analysis

Regression analysis seems to be the grandfather of all multivariate analytical techniques. What it usually does is to find an equation which relates a variable of interest, such as amount consumed in the past 30 days, purchase intent, number of items owned or any other numeric variable, to one or more other demographic, psychographic or behavioral variables. The variable of interest is called the dependent or criterion variable, the others are the independent or predictor variables.

When the dependent variable is either purchase interest or overall opinion of a product, some researchers say that they are building a "driver model." They're trying to find which product attributes "drive'' overall opinion of the test product, say.

The major thing to recognize in regression analysis is that the dependent variable is supposed to be a quantity such as how much, how many, how often, how far? The computer won't tell you if you've defined the variable of interest wrong, either, so it's up to you or your colleagues. Most regression models will leave you with an equation that shows only the predictor variables which are statistically significant. One misconception that many people have is that the statistically significant variables are also those which are substantive from a marketing perspective. They won't necessarily be. It's up to you to decide which are which. A couple of buzzwords that come primarily from regression analysis are:

Multicollinearity. The degree to which your predictor variables are correlated or redundant. In a nutshell, it's a measure of the extent that two or more variables are telling you the same thing.
R-squared. A measure of the proportion of variance in, say, amount consumed that is accounted for by the variability in the other measures that are in your final equation. You shouldn't ignore it, but it's probably overemphasized.

There are a variety of ways to get to the final equation for your data but the thing to recognize for now is that if you want to build a relationship between a quantitative variable and one or more other variables (either quantitative or qualitative), regression analysis will probably get you started.

Discriminant analysis

Discriminant analysis is very similar to regression analysis, except that here the dependent variable will be a category: Brand used most often, product usage (heavy, medium, light, not aware). The output from a discriminant analysis will be one or more equations which can be used to put people (usually) with a given profile into the appropriate slot or pigeonhole. As with regression, the predictor variables can be a mixed bag of both qualitative and quantitative.

Again, the computer packages around won't save you from yourself and tell you when you should use regression analysis and when to use discriminant analysis, so you'll have to be on your toes. Also you should be aware that the IRS is a big user of discriminant analysis. The categories of interest to them are "Audit" and "Don't Bother." You can imagine what the predictors are, especially if you're starting to fret over the new tax forms.

Marketing researchers frequently use discriminant analysis to profile users of various brands within a given product category. It's also used to determine what, if any, differences there are between, say, "Trier? acceptors," "Trier-rejecter" and "Non-triers." In the past it was heavily used in credit scoring. It probably still is. As with regression, you need to be concerned with statistical vs. substantive significance, multicollinearity and it? squared (or its equivalent). Used correctly, it's a powerful tool since so much marketing research data is categorical.

Logistic regression

Logistic regression does the same things as regression analysis as far as sorting out the significant predictor variables from the chaff, but the dependent variable is usually a 0-1 type, similar to discriminant analysis. However, rather than the usual regression type equation as output, a logistic regression gives the user an equation with all of the predicted values constrained to be between 0 and 1.

'Why bother? Most users of logistic regression use it to develop such things as probability of purchase from concept tests. If a given respondent gives positive purchase intent, they're coded as "1" in the input data set; a negative intent yields a "0" for the input. Now looking at both the demographics of the respondents and their product evaluation data, a model is built that allows the researcher to say things like, "Males aged 35-49 have a .87 probability to buy this product, females who are between 18-35 have a .43 chance,. . ." It can also be used instead of discriminant analysis when there are only two categories of interest.

Factor analysis

There are several different methodologies which wear the guise of factor analysis. Generally, they're all attempting to do the same thing. Find groups, chunks, clumps or segments of variables which are correlated within the chunk and uncorrelated with those in the other chunks. The chunks are called factors.

Most factor analyses depend on the correlation matrix of all pairs of variables across all of the respondents in the sample. Also, as it is commonly used, factor analysis refers to grouping the variables or items in your questionnaire together. However, Q?factor refers to putting the respondents together, again by similarity of their answers to a given set of questions. Two of the troublesome terms from factor analysis are:

Eigenvalue. Although mathematicians would blanch, all you really need to know about eigenvalues in a factor analysis is that they add up to the number of variables that you started with and each one is proportional to the amount of variance explained by a given factor. Analysts use eigenvalues to help decide when a factor analysis is a good one and also how many factors they'll use in a given analysis.
Rotation. In addition to doing it to your tires, doing it to an initial set of factors will give a result that will be much easier to interpret. It's a result of rotation that labels such as "price sensitive," "convenience" and so on are applied to the factor.

Although the literature says that factor analysis should only be done on quantitative variables, we've seen some that are very understandable when conducted on yes-no type variables as well. As with most multivariate procedures, that seems to be the bottom line for factor analysis: Does it make sense? If yes, it's a good one; otherwise it's probably not, irrespective of what the eigenvalues say.

Cluster analysis

Now the clumps of interest are respondents, instead of variables. As with factor analysis, there are a number of algorithms around to do cluster analysis. Also, clusters are usually not formed on the basis of correlation coefficients. They usually look at squared differences between respondents on the actual variables you're using to cluster. If two respondents have a large squared difference (relative to other pairs of respondents) they end up in different clusters. If the squared differences are small, they go into the same cluster.

One word of caution. Not all cluster software can easily handle categorical variables. For instance, if you're trying to cluster using brand used most often, which has four categories, you need to be sure to use a program which will cluster such nominal scale responses. Otherwise, you'll get a cluster mean on brand of 2.34 or some such, which is tough to interpret, at best. Most cluster programs do OK on either quantitative data or yes?no type data. A couple do handle multiple categories as well.

Perceptual mapping

A perceptual map can be used to show relative similarities between such things as:

Brands
Product attributes
Both
Cluster groups
Factor scores
Most anything else of interest in marketing research.

An appropriate map can serve as an excellent data summary and presentation device. Several of the mapping programs do much the same as factor analysis. Some use regression. You can also map the results of a discriminant analysis.

One major thing to remember when you're faced with a perceptual map: What you see is only a two- dimensional picture of the interrelationships in your data set. It may take three or more dimensions to adequately represent your data; hence, your two-dimensional view might be leading you astray.

Most mapping procedures provide a measure or two of how well the two-dimensional map captures the data relationships. Be sure that you are given these measures.

Another thing to keep in mind is that many maps are going to show you relative positioning or differences and not absolutes. Factor analysis, being based on correlations, does this too.

Combinations

At the risk of going overboard, again, on jargon, some studies use combinations of techniques. For instance, each brand might be scored on the factor results. Then, brands are used as criterion variables with factor scores to discriminate between them. A driver model might be evaluated for each brand using the raw data (not factored) and respondents could be clustered on their perceptions of a single brand plus their demographics. A perceptual map is constructed showing the cluster groups and brand ratings, another from the discriminate analysis. This is not on a typical scenario.

Ask, then invest

It's easy to overwhelm and be overwhelmed by the vocabulary alone of multivariate data analysis, let alone the interpretation of the same. Adding to the problem is computer literacy without attendant statistical literacy. Most programs/packages will do whatever analyses you request on whichever data you feed them. With the above information, I hope that as a minimum, you'll be able to ask the right questions before investing in an unwarranted multivariate procedure.