Editor's note: Dr. Steven Struhl is vice president, senior methodologist at the Chicago office of Total Research.

Classification tree analysis (most commonly CHAID or CART) remains less familiar than many other analytical methods, although it often solves thorny problems with complex data that can defeat most other approaches. After a brief summary of these methods and what they do we'll review KnowledgeSEEKER 3.1, the new CHAID/CART program.

CHAID and CART work by splitting the sample to create groups that differ as much as possible in terms of a dependent (or criterion) variable that the analyst chooses. As most statisticians use the nomenclature, CHAID (or Chi-Square Automatic Interaction Detection) works with categorical dependent variables such as region of the country, gender, market segments to which people belong, etc. CART (or Classification and Regression Trees) works with continuous dependent variables, such as dollars spent, number of boxes of SoggyOs cereal consumed, and so on.

Looking at "How Classification Tree Analysis Works" (Figure 1), we see first that the procedure starts with a sample (usually of moderate to large size like this one) and a dependent variable specified by the analyst. In this case, the dependent variable is likelihood to eat everybody's favorite breakfast substance, SoggyOs.

We will use classification tree analysis to examine all the other data we have collected concerning these 1,455 respondents and to identify all variables that lead to sub-groups differing significantly in likelihood to eat SoggyOs. Then we will select one of these variables as the first predictor (or the variable) that defines subgroups. In our example, the variable we selected was the type of city or town where the respondent lived. Of those who live in suburban areas, some 22 percent eat SoggyOs. Among those living in either the city or rural areas, only 17 percent do.

This first split of the total sample shows some of the great flexibility this method has in identifying sub-groups. In this example, the procedure automatically combined two types of respondents (live in city and live in rural areas) into a single group. We did not have to instruct the program to do this.

Rather it examined all possible ways of splitting the sample based on this variable. Since this variable comprised only three categories (city, suburban, rural), the sample could be split into at most three groups. Counting all possible two- and three-way splits of the sample, the program needed to compare four alternative splitting schemes. (One split would break the sample into three groups, each having one geographic category, while the other splits each would combine two of the categories and contrast those with the one category remaining.) Just doing these four comparisons does not seem like a great deal of work, but when the variable has more categories, the number of possible splits can go into the millions.

After performing the first split, you would then continue using the procedure to examine the subgroups just formed - in any sequence that you choose. We decided here to look at the subgroup living in the suburbs - already relatively rich in eaters of SoggyOs. Splitting this group further should produce smaller subgroups, in some of which eating this breakfast substance is highly prevalent.

Nothing in the method demands that we split this suburban subgroup any further or that we select them for investigation first (before the less enlightened consumers in the city and rural regions). CHAID and CART allow you great flexibility in how you analyze the model and where you stop.

We decided to select an attitudinal variable as the basis for the next split in the model. Focusing on the 776 suburban respondents, we found a significant differentiator in the extent to which they agreed that "There is no such thing as too much sugar." Of those who agreed completely with this modest proposition, some 28 percent ate SoggyOs. Of those who disagreed completely, only 10 percent did. Those with middling levels of agreement were about as likely as the overall average to be SoggyOs eaters (some 19 percent). Now, based on one attitudinal and one demographic variable we have already identified several groups differing strongly in terms of the behavior we wish to understand (eating SoggyOs).

After performing this split of the sample, we could further analyze the three new subgroups formed. For instance, we could look at the 433 respondents who both lived in the suburbs and agreed completely that "There is no such thing as too much sugar." Or perhaps, we could go back and look at the 679 rural and urban respondents who we have not yet analyzed. Panel 3 of Figure 1 describes the choice at this juncture in the analysis in more detail.

The analysis continues, creating smaller and smaller subgroups until it reaches some minimum group size (that the analyst sets) or until no more significant predictors emerge. (The analyst also sets the threshold for statistical significance.)

The full analysis appears in Panel 4 of Figure 1, in the form of a "tree diagram." This tree serves as a basic way of displaying and working with data in this form of analysis.

When completed, the tree shows all the splits that we chose to make in the course of differentiating those more likely to eat SoggyOs than those less likely to eat them. The last variable we chose to enter into the analysis was the respondent's age. We found that younger eaters consumed more SoggyOs than did older people, once we allowed for the other two variables (type of town or city, and agreement with "no such thing as too much sugar.") We now have 10 subgroups, ranging in likelihood of consuming from 5 percent to 29 percent. The groups range in size from 36 to 311.

For this example, we created a nice, fairly symmetrical tree, with the same variables appearing in all locations (or "nodes") in a given level or tier of the tree. You do not need to create trees that "match" across each level, as this one does. Creating this highly consistent diagram in fact caused the model to fall short of the apparently "optimal" one, by about 2 percent correct classification. In this case, we felt that this was a small penalty to pay in overall accuracy to arrive at a model that the client found completely intelligible and useful.

Some distinctive features of classification tree methods

Using classification tree methods, you almost always can construct many alternative models with nearly the same predictive power. (In somewhat more statistical terms, you can create many "nearly optimal" models.) Some statisticians have criticized this method because it can create models that look very different and yet have about the same predictive power overall. We could argue, though, that this is in fact a real strength of the method. For instance, you can create a "nearly optimal model" with just demographic items, another with just responses to a survey and other models combining the two. Once you get away from the idea that you need one "best" model, and understand that you can create many useful and "nearly best" models, this technique starts to appear much more powerful.

I never run KnowledgeSEEKER on its "automatic" model-building setting, although you can. It provides much greater insight into the data to use the program to find which variables would be significant predictors at each point in the splitting and re-splitting of the sample. Examining these, you can then find which of these candidate variables will provide the most useful information and instruct the program to display the most significant sample-split based on that variable. After completing a model, I invariably go back and create other alternative ones. This process of interactive model- building can give you a thorough understanding of the data, particularly about the characteristics of sub-groups you might not otherwise find.

Classification trees become increasingly valuable as the data we try to analyze becomes more complex or more "peppered" with the irregularities that often frustrate other analytical approaches. The issue of dealing with complicated and sometimes incomplete or inconsistent data emerges as more critical as organizations try to make more strategic use of databases or to link databases with surveys.

These procedures in particular have great strength in handling missing data. With most multivariate methods (including for instance, regression, discriminant analysis and factor analysis) missing data can pose serious problems. If respondents have any missing response, these procedures will (by default) drop them entirely from the analysis. You can fill in the missing values with means (or some other values) but by doing this you introduce something into the analysis not present in the original data.

Classification tree analysis treats a missing value as a type of response but with extra control over how it is handled. Missing responses can be allowed to combine freely with any responses, so that they lead to groups varying as much as possible in terms of the dependent variable. Alternatively, the procedure can hold them to one side, not allowing them to combine with any other response.

You specify how CHAID and CART handle missing values as part of the general definitions you make for all the independent, or predictor, variables in the analysis. You can handle these variables either as continuous or categorical. Continuous variables, like weight, dollars owed or boxes of SoggyOs consumed, can have one or several missing value codes specified. The program will report means and standard deviations for variables of this type.

You can treat categorical variables in several ways: as monotonic, floating, free or with no combination. Monotonic variables must define subgroups with the variables' codes kept in strict sequence. If, for instance, you have codes 1, 2, 3 and 4, then you cannot create a group defined by those with codes 1, 2 and 4, contrasting with a group defined by code 3. This type of grouping often makes sense with rating scales but usually would not be useful if the codes 1, 2, 3 and 4 actually stood for nominal values such as North, South, East and West. With nominal variables, using the free combination option works better. Supposing we had a variable with the values North, South, East and West, the program would then allow these regions to group in any way that led to the strongest differences in the dependent variable.

Floating variables are like monotonic ones, with the added feature that missing values are allowed to combine freely with any other group formed (or float). Actual codes must get grouped in sequence, but the missing responses can appear as a code in any of the groups formed. For example, suppose we had a variable with codes 1, 2, 3, 4 and missing. The procedure could not (for instance) create a group defined by those with codes 1 and 4 (and another with codes 2 and 3), but it could create a group defined by code 1 and missing responses vs. codes 2, 3 and 4. Or codes 1 and 2 could appear in one group and codes 3, 4 and missing in another - and so on. This gives great flexibility in dealing with missing responses, and also gives a way of "filling" these with values that will maximize differences in the dependent variable in the analysis.

Specifying no combination forces the program to create a subgroup corresponding to each code in the variable. A variable with four codes will lead to four subgroups, a variable with five codes will lead to five groups--and so on. If you specify a variable as no combination, and it then does not pass tests for statistical significance and minimum group size, then it will not get onto the list of significant predictors. You probably will not use this option often; I have seen it in two or three analyses over the last 10 years.

KnowledgeSEEKER 3.1 for Windows

KnowledgeSEEKER (from ANGOSS Software) is a fine piece of software that deserves far more recognition. In their recent thoughtful and scholarly review of CHAID and CART software, Chaturvedi and Green (1995) did not even mention this program. This is unfortunate, because KnowledgeSEEKER's analytical capabilities place it ahead of its principal competitors - strong as these products are in their own right. (These products are CHAID for Windows from SPSS and CART from Salford Associates. We will discuss them briefly in this review.) With its new release, KnowledgeSEEKER (KS) has become the most comprehensive and analytically powerful program available for classification tree analysis.

KnowledgeSEEKER excels in handling categorical variables with many categories. For instance, let's look into the hypothetical worldwide SoggyOs analysis in Figure 2. Here we have respondents from over 100 countries and how many boxes of SoggyOs they ate on average. (This comes from a real classification tree analysis, with SoggyOs substituted for the real dependent variable to protect the innocent.)

KnowledgeSEEKER analyzed this problem simply and easily. It sorted the 100 countries of origin into eight groups. The countries in each group differed significantly from the countries in all other groups in terms of how many boxes of SoggyOs respondents consumed. In the diagram, the first box shows the worldwide mean, its standard deviation and the total number of respondents. Boxes lower in the diagram dispense with the standard deviations and show just means and numbers in each group formed. The number directly below the phrase "country where consumed" is the significance level of the split. (This is some 3 x 10-15, or very, very significant. The test used to determine significance comes directly below, showing an F-value of 118.7.) The parenthetical numbers next to the countries show the numbers of respondents from each. As you can see, KS has tackled a highly difficult comparison and has found just a few groups differing very dramatically in terms of boxes of SoggyOs eaten.

The major competitive products could handle this problem to an extent - but either would require more work from you or provide less detail. CHAID from SPSS has a limit of 31 categories in a dependent variable. The procedure simply skips any variable with 32 or more categories in its responses. You can recode any offending variables manually (using the base program of SPSS or another statistical program), combining countries into 31 groups or fewer. This involves ample extra work, though, and might obscure some detail that you would find useful. Like KS, CHAID from SPSS can create anywhere from two to 15 subgroups at any point in the tree - as needed by the analysis, and as permitted by lower size limits on groups created. Again, the number of groups created gets determined by statistical tests, looking at all possible ways of splitting the sample.

CART from Salford Associates can handle many codes in dependent variables, but only can split the sample into two groups at any point (or, as statisticians will sometimes state, "bifurcate" the sample). Salford provides a rationale for limiting CART to two-way splits, in part maintaining that a series of two-way splits can lead to the same outcome as one many-way split. In this writer's experience, though, you do not get the same informational content or descriptive value from many two-way divisions as from a single many-way split that lays out the strongest contrast in one step. But here we get into a difference of opinion based on experience, rather than any hard and fast statistical rules. No matter what we believe statistically, though, restricting the analysis to two-way splits imposes a definite structure on the data. Splitting the sample in no more than 15 ways at a point imposes a structure also, but allowing the sample to split into as many as 15 subgroups seems closer to our (human) analytical limits - and so less of a real-world restriction.

KnowledgeSEEKER can handle categorical variables (like the country where SoggyOs were eaten) having up to about 2,000 categories. If you restrict the categorical variable to monotonic combination, it can handle 4,000 to 5,000 categories. KS can mix categorical with ordinal and continuous variables in one analysis. KS also guesses (usually correctly) how to handle each dependent variable - as continuous or categorical, and if categorical, whether treated as monotonic, floating or freely-combining. You can override its informed decisions - and checking the variable specifications (as we will discuss later) plays a critical role in any classification tree analysis.

In identifying variables by type and in determining statistical significance, highly complex rules (amounting to a form of artificial intelligence) come into play. Making complex many-way comparisons involves much more than at first may be apparent. Some early work in the development of KnowledgeSEEKER, published in IEEE proceedings on artificial intelligence (DeVille, 1990), pointed out the difficulties inherent in determining whether results are statistically significant. Exhaustive testing (in the form of "Monte Carlo simulations") showed that the methods typically used for handling many-way comparisons in analysis of variance tended to become inaccurate with larger numbers of comparisons. (For instance, the SNK method tended to be too liberal in declaring results "significant," while Scheffe's method was too stringent.) KnowledgeSEEKER's statistical rules for complex comparisons come directly from the empirical results of this extensive testing. As a result, the analyst can allow this program to attack data of incredible complexity with assurance that any results identified as statistically significant truly pass the requisite tests.

Working with KnowledgeSEEKER

We used KnowledgeSEEKER extensively on two IBM-compatible PCs, both "middling" in power. One was a 486-based DX2-50 with 16 MB of RAM, the other a 486 DX2- 66 with 32 MB of RAM. We tried the program under Windows 3.11, the "final release" beta (or test) version of Windows 95, and the actual Windows 95 operating system. We saw no appreciable differences in operation of the program based on the different computers or versions of the operating system used.

Setting up the program
Installation of the program is simple and ran smoothly from three floppy disks. The program occupies less than 4 MB of hard drive space, including example files and its graphing module. The program automatically makes its own group, or window, under Windows 3.1, but it is a simple matter to move the three KS icons elsewhere, should you wish.

Starting the program
KnowledgeSEEKER starts with a simple, blank screen. You must first identify the data file and then have the program convert it into KnowledgeSEEKER's proprietary format. Conversion times usually are nominal for survey-sized data files. With more cases and variables, though, and with more complex variables (usually nominal-level variables with many codes), conversion times can get longer. For instance conversion took 20 minutes in an analysis starting with 11,000 records and 450 variables (about a dozen of which were nominal-level with a few hundred possible responses).

Before you convert the file to KS format, you must have in mind the dependent variable (the one forming the basis for all splitting of the sample). You can later change this variable, but you must identify something for the analysis to begin. KS will examine this variable and make an informed guess about whether it is continuous or categorical. This matters, because with a continuous variable you will see an average (or mean) value at each point in the tree, while with a categorical variable, you will see the percentage in each category. (You can see examples of each type of dependent variable in the figures in this article.)

Once the data appears in KS format, several other menu selectors appear, in this order:

File Edit View Grow Reshape Options Window Help

Each of these selectors can sprout a menu at the appropriate time. Figure 3 shows how the full menu bar looks once a KS data file is open. The name of the file (in this case, the poetically titled T4368NB1.DAT) appears in the title bar.

You can start with data in an ASCII file, in dBASE, Paradox or Lotus worksheet formats, and SPSS or SAS file formats, among others. If you use SAS or SPSS files, KS will use both any long variable names and any long value labels that you have defined. This means that if, for instance, you have a variable with the long label "Region of the country," that long label will appear in your KS analysis. Similarly, if you have labels for the values, such as North, South, East, and West (for codes 1, 2, 3 and 4), then these descriptive labels (not the number codes) will appear in the analysis. You can see how KS uses long labels in Figure 2.

The first critical decisions: "Edit View"

You should save the file immediately once KS converts it. Then you should start checking all the variables to make sure that KS has correctly identified them, and so will be handling them in the way you want. This checking takes place under the Edit menu, within the operation called Edit View. Without doubt, this serves as one of the most important steps in the analysis. If you do not check how the program plans to handle variables, you inevitably will find yourself going back to Edit View, to change definitions repeatedly. The program does the best it can, but it still does not understand the variables and what they mean the same way that you do. Oddly enough, it seems that variables which get mis-specified (from your perspective) have a way of often appearing on the list of significant predictors.

Figure 4 shows the Edit View screen, with several variables selected for checking at once. You can check and change the variables one at a time or select a group (as in this figure) and change them all in various ways. Earlier, we mentioned that you should save the file before going into the Edit View window. That is important because this is one of the few places where KS might get bogged down and, under certain circumstances, stop working. In our experience, these rare failures always occurred when we were trying to change the types of too many complex variables at once. (In particular, changing several variables with many categories from being defined as "monotonic" to being defined as "free" -or freely combining - can bring the program to a halt.) The solution to this problem (aside from saving your work often) lies in simply not selecting too many highly complex variables to work on in the Edit View window at one time. KS does not often have this type of failure, but the last thing you want to see when working on a complex file is a message window saying "Assertion Failed," followed by the program shutting down. Perhaps the makers of KS can adjust the program to prevent the user from putting too many complex variables into the Edit View window at the same time. Or perhaps they can rewrite the program to prevent it from ever overloading. Either of these changes would come first on a wish list for future development of KS.

"Mapping Data" - Also under the editing menu, this option allows you to do simple recoding of the data. (Mapping means recoding in current KS parlance.) If you have complex recoding to do, or need to recode many variables, you will find it faster and easier to use a full-featured statistics program. As mentioned above, KS will use both variable labels and value labels that you have defined in either SPSS or SAS files. Also, while KS imports data (reads it into its own format), it does not export data back out again. Any recoding that you do within KS will stay strictly within KS. You can generate a "view" of a data file in spreadsheet format, and then paste the entire thing into a program such as Lotus 1-2-3 or Excel, but this seems rather cumbersome. An "export" feature is also on the wish list for this program.

"Grow" - This menu houses the central activity of this program. Here you tell the program to find all possible ways to split the sample at a given point in the analysis or to "force" a split of your devising. The program makes good use of shortcut keys in this and other basic operations. Simply highlight the point in the tree you want to analyze and type a lower case "f" to find all splits. Alternatively, type an uppercase "F" to force or specify a particular way of splitting the sample.

Once you find splits, typing a "g" opens a window that shows all variables that pass the criteria you have set (more on criteria below). Figure 5 shows the window that opens, titled "Choose interesting split." (The selection hidden under the gray bar is "height [cm]," the variable that appears on the screen.) You can scroll through these possible predictors and quickly examine how the tree will look choosing each of the candidates for defining the split. Alternatively, you can see how all the possible predictors work, one at a time, in order of statistical significance, by repeatedly typing "i."

You also can let the program run on "automatic" from this point. Smart as KS is about statistical testing, though, do not expect it to think through your problem for you. You almost inevitably will find it more useful to examine possible ways of splitting the sample at each point and to select the one you find most useful.

"Filter" - Here we find another critical decision for growing trees, namely the minimum significance level that you will accept for a split. If you choose, for instance, the 0.01 (or 99 percent certainty) level, then the program will restrict the possible candidates for splitting the sample to those variables that pass the test. For each candidate variable, KS will find the one best way for defining a split of the sample.

KS allows you to set this as tightly or loosely as you wish. For instance, if you set this to 0.99, every variable with a 1 percent or better chance of being significant will appear on the list of candidates. This is one definitive way to settle all arguments with those who insist that some favorite concern of theirs must have just missed statistical significance by some fluke or the slightest of whiskers. You can, as a result, have the great satisfaction of (for instance) telling the second assistant brand manager that, as an idea for a premium, the membership in the great books club stands about a snowball's chance in hell of piquing more buying interest. And you will have the numbers to prove it.

"Reshape" - This lets you get rid of splits that you do not want and prune entire levels away from the tree. This often proves handy when the analysis goes in directions you do not find useful or the model grows too complex.

"Options" - Hidden in the Options menu, unfortunately, you will find one of the most crucial controls for running KS, labeled somewhat cryptically "tree growing" and then "create size." This sets the minimum size for any group that the program splits off from the rest. You need to set this. The default size is ONE.

While it may make sense with certain precision measurements to isolate one case from all others (and KS will perform the test correctly), this never works with real world data. Set your minimum to something much larger than a single case.

If you want a good suggestion for a minimum size, look at the value the program selects for its "stop size," the smallest group that it will form running on automatic. We usually set the minimum group size at about 5 percent of the total for samples in the hundreds, and about 3 percent for samples in the thousands. No matter how small the total sample, we recommend that you regard as largely qualitative any splitting which results in groups of fewer than 20. Again, since KS uses statistical testing that has been optimized for small samples, it may tell you that everything seems just fine if you elect to split off tiny groups. But I know and you know that real data simply has too many anomalies to trust with tiny samples, no matter what the computer says.

The Options menu also allows you to test how well you have done up to a given point in the analysis. The "Resub Error" option will show the correct classification level for a categorical dependent variable or the r-square for a continuous dependent.

A short KS wish list

This program rates a solid "excellent" (if not "indispensable") overall, but it still remains software, and as such, could do certain things better. We already mentioned its occasional tendency to quit during complex "Edit View" operations and its lack of file export capabilities as areas that could use some work. Perhaps the largest gap in this program's capabilities comes in its relatively slender data manipulation capabilities. It will not create any new variables once you have the file in KS format, so you can't do a simple operation like adding two variables together or taking the difference of two variables. You must go to the program that created the data file, do the manipulations there, and then re-import the whole file. Similarly, KS cannot append variables to a file it has already saved in its format. Every time you think of something else you might have done with the data (and a thorough KS analysis will encourage plenty of this type of thought), you have to go back and start from scratch, rebuilding the entire KS file. This can get irritating with smaller files, and truly try your patience with larger files where the data transformation into KS format takes longer.

You must take the classification tree output from KS and paste it into another program to use it. KS does not save the finished classification tree as an entity separate from the data file. You will find it simple to edit the tree diagrams with most graphics and presentation packages. However, it would be much more convenient to have a means in KS for saving these images so that you can call them up later at your leisure. Please note that with Windows 95, and plenty of RAM (we used computers with 16 MB or more), you can keep KS open indefinitely, while you work with your other applications, moving back and forth, cutting and pasting until you reach exhaustion. (Windows 95 is another topic that we will tackle in a later issue of Quirk's.)

KS also produces some text-based output, in particular gains analyses, that you may find very useful. Gains analyses show all groups formed, in descending order of prevalence of the dependent variable. So if the dependent variable is boxes of SoggyOs consumed, then the groups get listed in order of incidence of eating SoggyOs. The group at the top of the list has the highest level and the unenlightened group at the bottom the lowest incidence. KS will save these text files separately from the analysis data set.

Major advances

Overall, we find classification tree analysis and particularly KnowledgeSEEKER, to provide major advances in analyzing complex data. KnowledgeSEEKER's great flexibility in handling different kinds of data and missing data make it a particularly valuable tool for revealing strategically useful information. Once the new user grows accustomed to the classification tree as a way of looking at the data, the program is straightforward and quite friendly to use. The largest obstacle in this program for many would-be users, in fact, seems to lie in the basic idea of splitting (and resplitting) the sample to find contrasting groups. Once you pass this barrier, you should find KnowledgeSEEKER an indispensable tool making informed decisions.

Note: KnowledgeSEEKER, from ANGOSS Software International, Toronto (416-593-1122), is also available from Sawtooth Technologies, Evanston, Ill. (708-866-0870).

References
Chaturvedi, A. & Green, P. (1995). "Software review: SPSS for Windows, CHAID 6.0." Journal of Marketing Research XXIII (May 1995), pp. 245-254.

DeVille, B. (1990). "Applying statistical knowledge to database analysis and knowledge base construction." Proceedings of the Sixth Conference on Artificial Intelligence Applications (IEEE Computer Society: Los Alamitos, Calif.).