C. Ying Li is a demographic statistician at National Planning Data Corp, Ithaca , New York , where she is responsible for information product design and analysis. Marketing and related research are her primary interests. She immigrated to the United States from Taiwan 10 years ago, and has since obtained an M.A. in Chinese history from University of California at Davis, an M.S. in statistics from Cornell University, and worked considerably in computing, social/economic research, and teaching.

Not long ago, the most a market researcher could do for his/her client was to conduct a few surveys at strategic points, describe the results, and then come to some impressionistic conclusions. Such a simple procedure nonetheless allowed an experienced researcher to do a reasonable job.

Today, however, researchers are equipped with an abundance of information supplied by third parties (e.g. government censuses and private forecasts) and sophisticated tools (e.g. multivariate statistical methods). But even the best trained statisticians are not always certain how to properly use these materials and methods. Clearly, researchers must take advantage of such technical advances, especially when they can be easily employed by friendly and powerful computers.

As a researcher who sees more and more data being collected and analyzed everyday, I am encouraged by the increased use of these new sources of information and techniques, but troubled by the frequent lack of understanding of them apparent in much research.

This article focuses on a popular kind of canned demographic data: geo-demographic clusters.   Geo-demographic clusters are marketed by their developers as the definitive answer to market-segmentation problems.

Demographics at a quick glance

Demographic data, used judiciously, can shed important light on a marketing phenomenon. They relate a product in a market area to its demographic profile defined primarily in terms of housing and population characteristics. As we all know, the Census Bureau provides the best data for two reasons.

  1. Only the Census Bureau, a federal agency, has the resources to collect data on 100% of the population, therefore making projection unnecessary.
  2. Even if population data are occasionally approximated from samples, they are calculated by trained staff following strict sampling procedures. The mathematical properties of these sample designs enable the Bureau staff to make the best population estimates with the least loss of information.

However, Census data may not be up to date, or they may not be oriented to the specific product-buying populations that concern market researchers. Currently, private data companies supply most postcensal, small-area projections and consumer/marketing information. The quality of their data depends a great deal on the models and techniques employed.

Most commercially available demographic data involve projection. A projection is a probabilistic statement about a larger phenomenon concerning the population. It is calculated by experts from past data samples under a set of restrictive assumptions. Since no set of assumptions are complete enough to account for all the forces that influence social and economic events, the projection is bound to be somewhat biased.

Despite imperfections, a soundly calculated projection is the best technique available for guiding research. A projection is considered sound if its underlying models are well understood and its assumptions can be demonstrated to be relatively realistic, and if its data are collected under a carefully planned sampling scheme. Such projections must be subjected to rigorous statistical tests, and measured against corresponding empirical data whenever the latter become available. Potential projection errors can also be estimated. From the users’ standpoint, those market researchers who depend on private companies to supply their research data should make a serious attempt to learn the projection assumptions (hence, the possible limitations). Such precautions forestall unanticipated results.

Lifestyle geo-demographic clustering

In the early 1980s, psychologists popularized the use of a multivariate technique called cluster (or classification) analysis in order to classify individuals into personality types based on a multitude of character measurements. The purpose of clustering is to discover some “natural” groupings among individuals so that the variation within groups is minimized while the variation between groups is maximized. In other words, individuals within the same group are closer to each other according to some measure than any is to members of other groups.

A number of private data companies applied this technique to data from the 1980 Census. They defined the basic unit of geography for their analysis to be as small as a block group in an urbanized area or an enumeration district in a rural area. A key assumption was that within these units the population tends to manifest similar characteristics. They then clustered such geographic units for the entire nation into an arbitrarily determined number of so-called “lifestyle neighborhoods.” “Lifestyle” is simply demographic characteristics, “neighborhoods” is the non-jargon expression for clusters. The final clusters (or groups) are subsequently identified with attractive yet vague labels. For example, the “Town and Country” cluster implies the rich and famous.

What the data companies have done amounts first to condensing hundreds of 1980 Census variables into six or seven “dimensional” factors by a factor analysis. These dimensions include mainly housing, income, age, education, social status, household composition, and ethnicity. These dimensions were claimed to embody the full explanatory power of all Census variables. After extracting these dimensions for all block groups/enumeration districts in the country, the data companies further condensed them into a single measure of “distance.” They were then able to group all those geographic units into a finite set of clusters, usually by using some kind of clustering computer program (of which there are many varieties) to distinguish the distances among the units.

Users of such methods know that no matter which cluster the blocks in question belong to, that cluster, as its description may indicate, specifies only the lifestyle of a large part of its residents. This description is in no way complete. For example, certain blocks of a city may be classified as belonging to “the rich folks” even though there are some poor people in the same community.

Once each of the smallest geographic units in the nation has been labeled with a cluster identity to highlight its predominant population, the data companies can break any user-defined marketplace down into block-group units, retrieve their cluster identities from the database, and aggregate the unit household counts for all clusters in that market area. They can also provide individual household addresses of the desirable clusters for direct mailing purposes. Geo-demographic clusters are useful to researchers interested in segmenting markets for two reasons. First, for a given marketplace it is reasonable to assume that people’s decisions to buy are linked to their demographic characteristics. This assumption justifies the comparison between consumer behavior and demographic characteristics. If this association is in fact true not only within a local unit, but also among units with similar demographic profiles across the country, then such clusters can be treated as natural market segments for planning purposes. Second, it is much easier to have the data company define appropriate market segments than it is for a marketer to conduct such extensive research independently.

Users of cluster systems can verify the existence of an association between demographic characteristics and consumer buying patterns in a market area by calculating the Spearman’s

rho, a statistical quantity that measures the correlation of two types of clusters based on ranks. Table 1 illustrates such calculations with an example.

The upper portion introduces a standard report from any cluster system with fictitious data. The lower portion shows the calculations of the Spearman’s rho for those data. Market area households on the left are usually supplied by the data company while magazine-subscribing households on the right are supplied by the user. A large volume, positive or negative, of the Spearman’s rho (ranging from -1 to 1) confirms the association, and hence the validity of such clusters for segmenting that market. Sometimes visual inspection of such an association may be sufficient. However, I would still recommend formal calculations.

What if no association is revealed by such techniques? How does one know whether there is a problem with one’s own product data, or whether the problem lies instead with the generic clusters? If the problem is with the product-ownership data, one must rely on the canned clusters. There is no knowing how strategies based on them will perform. If the problem lies with the clusters, then it is a good idea to check the clustering criteria and appropriateness of these clusters for the market in question. (This is, unfortunately, difficult to do since very few data companies are willing to disclose their “proprietary methodologies.”)

Clustering is not a single, cohesive set of techniques, but rather a collection of methods, each having an ad-hoc flavor for mending some inadequacies in the data. One cannot cluster without making subjective, sometimes arbitrary, decisions on:

  1. How many clusters should there be? For example, should there be 40 or 400 clusters to represent all possible lifestyles in the U.S.?
  2. How does one reconcile the different measurements in different units into a single distance (similarity/dissimilarity) measure?
  3. How does one decide on the appropriate boundaries for clusters, the descriptive label of each cluster, the method of clustering, and the criterion of statistical significance (that is, the measure of cluster compactness) of these clusters?

Because both the descriptive and statistical inferences employed by clustering techniques lack explicit structure, it is difficult to evaluate measures for describing cluster compactness, much less the predictive properties of clusters.

Because both the development and employment of clustering techniques involve so many such arbitrary or impressionistic assumptions and decisions-each decision may lead to completely different grouping-it is especially crucial to know what those decisions are and upon what assumptions they were based. It is very unlikely that a single set of clusters based only on demographic characteristics can work well for all products. The most effective segmentation strategy should vary from one product to the other.

However, clusters tailored to a specific product can be derived by applying a discriminant analysis, a multivariate statistical technique similar to regression, to those initial clusters formed on only demographic characteristics. A discriminant model can effectively employ the product-ownership data (e.g. product needs, frequency of use, prices, consumer preferences relevant to the product) to modify those demographic clusters. The independent variable in the model is the product-ownership data while the dependent variable is the demographic clusters. Such modified clusters should be more sensitive to particular marketing needs.

Some data companies have indeed improved their generic clusters with syndicated consumer data (e.g. data on car sales registrations, magazine subscriptions, real estate transactions, and media surveys.) Some companies claim to employ as many as 60 different sources in their discriminant models. However, if a model contains so many independent variables, each behaving quite differently in defining its customer base, then its ability to adjust clusters must be severely diminished because some variables might cancel the effect of others. Again, without knowing the mathematical forms of these models, it is difficult to evaluate their effectiveness.

It is no exaggeration to say that data users are at the mercy of data companies in the purchase of cluster systems. However, there are a few ways to avoid buying inappropriate clusters. One can purchase raw demographic (Census) data and do one’s own clustering. But this, of course, requires familiarity with cluster and discriminant analyses, and confidence in the quality of one’s own product data. Alternatively, one can require the data company to design a cluster analysis on a custom basis, specifying that they form the prototype clusters on the basis of demographic characteristics alone, and modify them later by a discriminant model using only one or two sources of syndicated data relevant to one’s own product. If this is impractical:

  1. Ask the data company to explain to you the parameters they used in their clustering computer programs.
  2. Ask them to provide examples of successful applications of their clusters to solve problems similar to yours.
  3. Spot check the detailed distribution tables of, say, income, age, or housing, for selected geographic units to see if the clusters truly represent the majority of their population. It goes without saying that contacting past users can also be helpful.

Good market-segmentation studies combine technique and judgment in a manner suitable to the objectives and information of the research. Clusters or segments that are founded on purely demographic characteristics can be just as misleading as those without them.