Editor’s note: Michael Lieberman is founder and president of Multivariate Solutions, a New York statistical and market research consulting firm.

It’s all about bang for the buck. You have a database of, say, 10 million voters. Or 10 million consumers. Or 100,000 association members. The database is chock full of goodies. Not only the normal stuff, such as demographics (gender, age, income, etc.), and political information (party affiliation, donations given, primary and general elections last voted in), but a wealth of other personal information. For example, whether the list holder rents or owns his home, the number of private schools in the district, whether there is a working woman in the household, the number of children in the home, whether they have a DVD or have contributed to a health or environmental organization, whether they own a sports utility vehicle or subscribe to magazines, etc.

An election is looming, and in effort to reach swing voters or energize those who potentially support your core issue, your organization would like to hit those potential targets with a direct mail piece or a phone call. Something that will energize or sway them. Or perhaps you work at a credit card company and would like to mail out a sampler to a million or so homes, but want fewer people to toss the piece in the trash without so much as a glance.

Let’s say you are trying to reach swing voters in the state of Utopia, a swing state where things are not perfect. We know from those existing records that we expect about 20 percent of the list to be swing voters. You could mail out a flyer to all 10 million, knowing that the hit rate is about one-in-five. Or you could target only swing voters and dramatically raise your efficiency and lower your costs. Only you don’t know who they are. So you can’t actually target them. But you can build a model and make a very educated guess: You can virtually target them.

In addition to your primary database, you have, say, 10,000 records from which you are able to determine your target group. These records could be drawn from other lists, primary research where ID numbers allow you to identify back to the main list, or company databases. These records give you the ability to build a link between swing voters and characteristics to help define them. What virtual targeting will do is to build, and test, a profile of who your target is.

The basics of virtual targeting

What distinguishes our target group, swing voters, from non-swing voters? Are there characteristics which could be used to identify them? How can we make, on the basis of several individual attributes, one assessment on the likelihood a given person is a swing voter?

Virtual targeting answers these questions using a blend of statistical techniques that 1) identifies distinguishing characteristics of the target group and then 2) builds a linear equation that can be applied to each of our 10 million records to calculate a score. When sorted, the hope is that the group with the highest scores will be the more likely swing voters.

The first step

The first step is to take the myriad variables available to us and, using our known swings, discover which variables distinguish our target group from our non-target. There are two techniques that can be applied, regression analysis and CHAID - a chi-square technique that creates a tree-like output. The variables at the top of the tree are the most useful to distinguish between swing/non-swing, and as the variables run down the branches their importance diminishes. Still, a variable that emerges in the top five to six branches is a good candidate for the final model.

A detailed explanation of regression and CHAID analysis is beyond the scope of this article. Basically, what each does in this case is to create a baseline variable - a measure of association - between our target group and characteristics available in literally hundreds of variables in the entire database. It enables a weeding-out process.

In our example, the swing voters of Utopia, we have run through the first step in the virtual targeting. We have run a regression and CHAID, and the attributes shown in Figure 1 have come up significant.

Not surprisingly, many political variables, such as election frequency and party affiliation, have made it into the model. After all, what we are looking for is a potential politically neutral block of voters. Naturally, those who are not politically neutral (for example, primary voters) would be an evident distinguishing variable.

However, other not-so-obvious demographic and social attributes made it into the model. For example, if the person is married, lives in a district with high private-school attendance and has a bank credit card, chances are that his or her swingness can be more easily identified.

Step two

Now we know what should be placed into the model. Either the initial regressions or CHAID trees have told us. So the next step is to run the model.

There are a number of multivariate techniques that can be used for this. They can have fancy names, such as logistic regression or forecasting membership by way of using an exponential probability model. These work in the proper situations, and sound pleasingly fancy to satisfy the clients that we are adding enough oomph to the equation so that it will be sophisticated. In truth, many credit card companies run these techniques on their enormous databases with terrific results.

However, Utopia is a state that likes its meat and potatoes. And, to be fair, our interest is to produce clear results that can be easily back-coded to the main list. So here I choose to use discriminant analysis, a multivariate technique that measures our input variables and produces coefficients that give us a measure of how much each attribute discriminates between swing voters and those who are committed.

Discriminant analysis produces a discriminant function. That is, a linear equation where coefficients are multiplied against the respondent attributes to produce a score. Derived from the discriminant score, a likelihood of each group membership (i.e., swingness) is calculated based on who we know are swing voters from the smaller sample. To put it simply, the respondent fills out the form and gets a score, which is then compared to a chart to see if he has a good chance of being a swing voter.

As in all sophisticated statistical analyses, a blizzard of output accompanies the procedure. There are three outputs that we need to examine: the beta scores of the discriminant function (known as the raw coefficients), the standardized coefficients (which tell us which are the best variables), and the discriminant score coupled with the percentage likelihood that score describes a member of our target group - swing.

The raw and standardized coefficients are used for descriptive and classification purposes, which I will cover below. The discriminant score, when calculated afterwards, is the instrument used for future classification.

In the virtual targeting model there is one more measure which is not necessarily used with discriminant analysis. We are looking for the strength of the model as it specifically applies to identifying swing voters. The method is straightforward. The software, after it runs the analysis, gives each respondent in the analysis a score. We sort the list from highest to lowest score, then look at, say, the top 10 percent. The idea is to see how much better the sorted list is than a random sample. For example, with our Utopian list, we expect 20 percent of voters to be swing. If we take the top 10 percent of our sorted list, and 30 percent or them are identified swing voters, we can see that our list is 50 percent more efficient than a random sample.

Let’s roll

Okay, let’s roll and see what happens. Not surprisingly the most discriminating factor is that the person is an independent. That is, not a member of either political party. The two other telling factors to determine swing is that the person contributes to a religious group and there is a working woman in the household.

Not surprisingly, if a person has voted in the primary election in 2000, or in a recent general election, he has a high negative coefficient. It is unlikely that he is a swing voter.

Virtual targeting is both descriptive and predictive. The descriptive side, illustrated by Figure 2, explains which factors rise to the top (or bottom) when running the model. This can be very interesting information. However, the real power of the technique lies in the simple ability to predict a person’s group. This is where the real bang for the buck comes in.

The chart in Figure 3 illustrates how a given person receives a discriminant score. The raw coefficients (not standardized, as above) are multiplied by a respondent’s answer, then tallied to create one score. At the bottom of Figure 3 this example’s score has been calculated. It is 1.9950. So, is that good? Keep reading.

The final useful output in our example is a list of all discriminant scores and the probability of that score’s respondent being a swing voter. This output can be sorted and displayed in a table which is partially shown in Figure 4.

This table is rather long, and functions as a look-up table. When one person goes through all the survey and has his score totaled, that score can be compared to scores on this look-up table to see what is the percentage chance that that person is a swing voter. In our case, 1.9950 has a 60 percent chance of being swing.  Call him.

How good?

The last, most important step: How good is the model? Would it make a lot of sense to score all 10 million?

A reading of the chart in Figure 5 from left to right goes like this. When the known swing voters are scored, and the scores are sorted highest to lowest, what percentage of the top 10 percent are swing voters? The answer, according to this chart (second column), is 41 percent. We would expect one-in-five (20 percent) to be swing if people were just randomly selected. So, if you divide 41/20, you get 2.05. Or, in other words, the model has more than doubled the efficiency of finding swing voters. The index, which multiplies this number by 100, is 205. That is high.

If you look at the top 20 percent of sorted sample, 35 percent of those are swing. Or, the model is 1.75 times more efficient with an index of 175.

As we work our way down the sample in order of score, the efficiency lessens. This is to be expected, since lower scores indicate less likelihood of being a swing voter.

Think about it. The organization is sending out one million pieces. If it does not run the virtual targeting, it can expect to reach about 200,000 swing voters.

If it does run the virtual targeting, and applies the scores to the general database, it can expect to reach 410,000 swing voters spending the same amount of money. That’s bang for the buck.