Editor’s note: Ryan Jennings is an analyst at Fair, Isaac and Company Inc., an Englewood, Colo., business analytics firm. Tony Dubitsky is brand research specialist at Monigle Associates, Inc., a Denver brand consulting firm. The authors are grateful to Michael Sossi for his expert technical advice on earlier versions of this article.

Question: What kind of customer prospecting tool is inexpensive, easy to develop and implement, and powerful even if you don’t have extensive promotion history or individual-level demographic information on your customers?

Answer: A zip code profile model. This tool can be used to leverage what you know about the areas where your current customers live to improve your customer prospecting.

A zip code profile model uses the geographic features of your current customers to help you predict where your best prospects reside. That said, you should use this tool only if your customer base is geographically dispersed. For example, it would not be appropriate if applied to customers of a new product available only in a narrowly defined test market, or to an established product that had an extremely limited distribution.

A key assumption of the zip code profile model is that “birds of a feather flock together” - the demographic characteristics of a customer’s zip code area can stand in for the demographic characteristics of the individual customer. You may question this assumption, but it’s the basis for many effective geodemographic systems over the past 25 years.

Not surprisingly, zip code profile modeling has benefits and limitations. On the plus side, it’s definitely inexpensive. Indeed, the only data you’ll need are the Census Bureau’s STF 3B $200 dataset, and a current list of U.S. zip codes, the latter of which can be found in most desktop mapping programs. Most of your cost will be incurred in manipulating rather than in purchasing this information.

Another plus is that a zip code profile model is easy to implement. The final outcome is a list of zip code areas that can be ranked based on your model’s prediction of customer penetration (the number of your customers — be they individuals or households - divided by the population). Once the model is complete and all U.S. zip codes are scored, there’s no need to deal with complex equations. To use the scores, all you need to do is match the zip codes on the scored list to the zip codes on your prospect list. Higher-scoring zip codes represent better prospecting opportunities than lower-scoring zips. But you don’t have to stop there. You can append the zip code profile model scores to individual customer records (in which customers in the same zip code area will receive the same model score). You could also use this information with other individual-based marketing and promotion variables to build a customer-level prospecting model.

On the minus side, you get what you pay for. Currently, zip code level data is only as recent as the 1990 Census. The good news is that an updated STF 3B file was scheduled for release in late 2002.

Another limitation is that the U.S. Postal Service (USPS) developed the zip code system as a way to improve mail delivery, not as a way to map markets. Consequently, the USPS may add, change, or delete zip codes without much, if any, forewarning. Nor is this unit of geography integrated within the hierarchy of other, more familiar Census-based areas, such as households, blocks, block groups, census tracts, counties, states, divisions, and regions. Since the USPS makes no effort to produce demographically homogeneous zip codes, demographic characteristics within a zip code can vary dramatically. As such, zip code demographics do not apply to each resident in a given zip code area, only to the area as a whole. In short, zip code-level demographics are not a substitute for individual-level demographics (the latter of which are relatively more expensive and may not match very well with customers in your file).

Finally, if you have a relatively small base (e.g., less than 20,000 customers), you may find that there is so much sparseness at the five-digit zip code level (comprising about 30,000 different unique “buckets”) or even at the rolled-up three-digit zip code level (comprising 881 different “buckets”) that a model will be difficult, if not impossible, to build.

Rationale for building a model

With all these drawbacks, you may wonder why it’s still worthwhile to go through the trouble of building a model. After all, isn’t it simpler to rank your zip code areas by customer penetration, and target those areas with the highest levels? Absolutely not - the beauty and power of a zip code profile model is that it will generate predictions for zip code areas that are not represented in your customer base. Put another way, it enables you to identify “opportunity regions” within which to prospect, even if you currently have zero or low customer penetration in them. This is because your model is using all of the zip code areas in the total U.S. universe, rather than cherry-picking those that are highly penetrated.

Model development and validation

The most appropriate statistical approach for building a zip code model is multiple regression, in which you’ll attempt to predict one outcome variable - the penetration of your customers in each zip code area - using many geodemographic variables from the U.S. Census.

Your first step in building the zip code profile model will be to extract the zip codes for each of your customers, then summarize customer counts at the zip code level. The result of this step will simply be a file with two fields, one with zip codes and the other with the corresponding number of your customers in each zip. Once this is done, the STF 3B dataset can be matched to your customer count file.

At first glance, the STF 3B Census files contain an overwhelming number of potential predictor variables. These files contain zip-level data from the long Census form (mailed to about one in six households, but weighted up to the full population), with information on age, gender, household composition, income, ethnicity, and a wealth of other demographic attributes. The sheer number of fields is slightly misleading, in that the file consists of many continuous variables that have been transformed into categories. For example, gender has been chopped up into its two levels - male and female - yielding a male variable and a female variable; race has been chopped up into the five variables of white, black, Indian, Asian, and other.

The variables are given in the Census files as raw counts; you’ll need to divide each by the appropriate base before using them in your model. For example, the number of males will need to be divided by the number of persons in each zip to derive the percent of the zip population that is male. This arithmetic makes the variables independent of the size of the zip code. You’ll want to do this because the size of the zip code is subject to the whims of the USPS; it’s not an attribute of those who reside in it.

The nature of Census data presents several challenges for the modeler. First, many potential predictors must be considered - and ultimately pared down — for a final model. Second, every attempt should be made to prevent highly related variables (such as “% Male” and “% Female”) from being entered together into a model; this leads to the phenomenon of multicollinearity, which can prevent a model from even being built.

These challenges make it all the more important for the analyst to have a thorough understanding of the dynamics of the data at simple and complex levels before attempting to build a model. For example, in a preliminary phase, the analyst may reject candidate predictors that show very little variability and/or those that show extreme low or high relationships with customer penetration.

Sample sizes permitting, one best practice is to develop the model on one subset of zip codes, and to validate it on the rest. Alternatively, there are several “small-sample” re-sampling approaches (e.g., the bootstrap and the jackknife) that can be used for validation. Once the initial model is developed and validated, you are ready to generate a final model on the entire set of zip codes. This final model can be used to score and rank each zip code. A spreadsheet of the ranked zips can be used to guide future direct marketing efforts (e.g., selecting all members of a rental list whose zips can be found in the top two deciles).

Table 1

Table 1 shows the results of a real-world zip code profile model for our client, a corporate owner of time-share resorts. The client had observed that customer penetration was particularly high in the Northeastern U.S. and the Central Census Region. There was interest in drilling down deeper to the zip code level and understanding what factors were driving penetration. Subsequent to preliminary data analysis, a model was developed on odd zip codes and validated on even zip codes, with virtually no falloff in performance. We then estimated the final model using all zip codes, producing the lift charts shown here. Key variables from the final model were proximity to the primary resort area, affluence (household income and occupation), and suburban lifestyle (low household density in non-farm areas).

The highlight in the table shows that the zip code profile model identifies 40 percent of households that account for over 80 percent of all customers. This standard report required the following steps:

  • A zip code profile model was built to predict customer penetration using Census data.
  • The model was used to generate a customer penetration prediction for each zip code in the U.S.
  • The zip codes were ranked in descending order based on the predicted customer penetration value.
  • Successive 10 percent “buckets” (i.e., deciles) of households were created; customers were counted within each decile.
  • The percent of customers within each decile was computed and cumulated.
  • Finally, the cumulative percent of customers within each decile was compared to the cumulative percent of households within each decile, creating an index representing the performance of the model versus “chance alone.”
    Figure 1

Figure 1 displays the same information graphically. Again, with the benefit of the zip code profile model, the top four deciles account for over 80 percent of all customers. To apply the model, the zip codes accounting for the top deciles of households should be targeted in customer prospecting and list selection. They represent the cream of the crop, containing a preponderance of current customers, and they share the demographic characteristics of the areas in which current customers live.

Importance of data visualization

Just as a picture is worth a thousand words, a thematic map - the geodemographic equivalent to the bar chart - is worth thousands of zip codes. We can’t overemphasize the usefulness of thematic maps in converting a lifeless table of 30,000 zip codes and penetration levels into one easily understandable and visually compelling strategic document. Visualizing how zip code areas “go together” is difficult without this device, especially because consecutive zip codes aren’t necessarily adjacent to one another geographically. We especially recommend that you plot the opportunity regions mentioned above as well as the top deciles of scored zip code areas. Figure 2 shows the top two decile areas in the East Coast Region for the time-share resort client described earlier.

Figure 2

Efficient, inexpensive

ZIP code profile modeling is an efficient and relatively inexpensive technique that can be used to drive customer-prospecting efforts. It is most effective when used in conjunction with thematic mapping technology. Key requirements are a fairly substantial customer base and a product or service showing geographic variability in penetration. A very useful result is a table of zip codes ranked by predicted customer penetration.