Portrait of a data miner

Editor’s note: Karl Rexer is president, Paul Gearan is senior consultant and Heather N. Allen is senior consultant at Winchester, Mass.-based Rexer Analytics.

But what exactly is data mining and who are the data miners? According to Wikipedia, data miners are individuals who sort through large amounts of data and pick out “relevant information.” This sounds interesting and important enough, but what are the tools of this archeology, what are its goals and what are the typical pitfalls faced in this pursuit? At Rexer Analytics, we had the perspectives of our own experiences and those of our colleagues. However, we were interested in broadening our understanding of this eclectic population to which we belong, so we asked a few hundred data miners to tell us more about who they are and what they do.

Far-flung

The data mining community is far-flung with no epicenter. Data miners can be found in industry and academia, in countries such as China  and Venezuela. There is no single umbrella or organization under which they congregate. Many do not even refer to what they do as data mining at all, preferring terms like knowledge discovery, business intelligence or just plain analysis. All of these factors make data miners a difficult population to learn about, and there have thus been few attempts to characterize this population to date (a notable exception being the single-item surveys offered by Gregory Piatetsky-Shapiro on his excellent KDnuggets site www.kdnuggets.com ).

In the spring of 2007 we set out to learn more about this amorphous community via an online survey. Data miners were queried about themselves (their location, education and experience), the challenges that they face, their datasets, the algorithms they favor, their preferred software and the software features most important to them.

Our first challenge (after winnowing the universe of questions we wanted to ask to a number which people might conceivably take the time to answer) was to find these data miners. In order to reach a broad array of individuals, we decided to employ the snowball method of data collection in which direct contacts were requested to forward the survey to others within the data mining community. The snowball method has been found to be useful when sampling hidden populations for which direct access is difficult (the populations typically used as examples are drug users and prostitutes).

Aware of the potential biases introduced by the snowball methodology, we put several controls into place: 1) all software vendors included in the survey were contacted and given the chance to recruit respondents; 2) initial contacts were given an access code which had to be used in order to gain entry into the survey, which allowed tracking of the initial source of each respondent; 3) respondents were asked how they learned about the survey; and 4) respondents were specifically asked whether they worked for a data mining software vendor (and if so, which vendor).

We started our snowballs by posting links to our survey on newsgroups (such as KDnuggets), user groups and blogs and by sending e-mails to the organizers of various data mining conferences, to data mining software vendors and to personal contacts within our network of acquaintances.

We ultimately received responses from 314 individuals in over a dozen fields and from 35 countries. One hundred respondents who were employed by software firms that produce data mining tools were removed from the data before the primary analyses, due to the possible bias in their perspective or motivation.

The 214 data miners remaining were a fairly diverse group. Fewer than half (44 percent) of our respondents were from the U.S., with Germany, the U.K. and Greece being the other countries with the most representation.

Most respondents held advanced degrees. About three in 10 held doctorate degrees, four in 10 held master’s degrees, 15 percent held bachelor’s or the international equivalent, and just under 10 percent held MBAs. There were also several respondents who held professional or “other” degrees.

Data miners also tended to have been in the field for a significant amount of time. One quarter of the respondents reported having worked in data mining for over 10 years, and another third for six to 10 years. Only 14 percent reported having worked in the field for less than two years.

As mentioned previously, data miners do not always refer to themselves as such. When asked their job role, 46 percent of respondents actually labeled themselves as data miners. Another 12 percent called themselves business analysts, and a further 12 percent self-labeled as researchers.

Data miners are applying their craft across a number of fields. The most commonly identified fields were CRM/marketing, finance and academia. However, a significant number of data miners also reported working with data in fields such as telecommunications, retail, insurance, Internet, technology, medicine and government.

Processes and tools

In addition to gathering information about data miners themselves, we asked data miners about their data, processes and tools. Some of our main findings were:

  • Predictive modeling and segmentation/clustering are the most common types of analyses that data miners conduct (89 percent and 77 percent, respectively). This was certainly consistent with our experience and makes sense given the preponderance of respondents working with CRM/marketing data.
  • Correspondingly, the most commonly used algorithms are regression (79 percent), decision trees (77 percent) and cluster analysis (72 percent). Again, this reflects what we have seen in our own work. Regression certainly remains the algorithm of choice for large sections of the academic community and within the financial services sector. More and more data miners, however, are using decision trees, and cluster analysis has long been the bedrock of the marketing community.
  • SPSS, SPSS Clementine, and SAS are the three most frequently utilized analytic tools and were each used in 2006 by more than 40 percent of data miners. Forty-five percent of data miners also employed their own code in 2006. Respondents were asked about 26 different software packages from the powerhouses above to less-visible and -utilized packages such as Chordiant, Fair Isaac and KEXN.
  • Comparisons of reported 2006 use and planned 2007 use show that there is increasing interest in the Oracle Data Mining tool, and decreasing interest in C4.5/C5.0/See5. It will be interesting to see how these trends develop over time and if other tools find greater prominence in the future.
  • The primary factors data miners consider when selecting an analytic tool are: 1) the dependability and stability of software, 2) the ability to handle large data sets, and 3) data manipulation capabilities. Data miners were least interested in the reputation of the software and the software’s compatibility either with other programs or with software used by colleagues.
  • The top challenges facing data miners are dirty data, data access and explaining data mining to others. Over three-quarters of data miners listed dirty data as one of the major challenges that they face. This is again consistent with our own experience and the conventional wisdom discussed at data mining conferences: a significant proportion of most projects consist of data understanding, data cleaning and data preparation.

Findings vary

While there was more consensus than disagreement on the above issues, the main findings do vary somewhat, depending on the domain the data miner works in, the tools used, geography and several other dimensions. Some of the more interesting distinctions included the fact that text mining is more commonly conducted by those working in the United States  than those working in other countries. There were differences also in the use of link analysis: it was more commonly reported by those working with government or military datasets.

The factors considered in selecting a tool differed somewhat by dataset domain. Those working with retail and telecommunications data felt that a tool’s speed was the top priority. However, those working with financial data were more concerned with the ability to automate repetitive tasks. Finally, those in academia were significantly less concerned with these practical issues.

Different priorities

The results of the present survey underscore that there is great diversity in the current data mining community. This community is both large and varied, with different constituencies reporting different priorities for tool selection, different analytic approaches and the use of a variety of software tools.

Notably, however, the challenges faced by data miners are more universal than disparate. As previously mentioned, three-quarters of respondents identified dirty data as a significant challenge that they face, while more than half identified data access and availability issues. Difficulty accessing clean datasets has always been and will probably always be a significant hurdle for those trying to transform data into knowledge.

It is important to remember that the results of this survey represent a mere snapshot in time. The field of data mining is currently undergoing explosive growth, expanding into new areas and developing new technologies. Even as recently as five years ago, it would have been surprising to field a survey in which more than half of respondents did not consider themselves data miners but are nonetheless users of data mining algorithms and familiar with a number of software tools.

As the methods and means of data mining move out of the realm of academics and analytic specialists and into wider populations, the needs of the data mining community will continue to shift, driving the development of new algorithms and tools. Many data mining products will likely have even greater plasticity, a wider range of available algorithms and applications and perhaps greater transparency to a broad audience of users.

Use expands

Tracking the future preferences and activities of this heterogeneous population will be both interesting and important as a window into understanding data mining as its use expands to touch many corners of our world. Keeping pace with the changing needs and preferences of practitioners will be a significant challenge for those who supply the tools for these endeavors. Demands in various industries will alter expectations for data miners, who will in turn seek tools that will yield expedient and powerful solutions. Data mining software providers will hear from data miners about what they need to fulfill these new visions.

In order to help fill this information need, we have prepared our second annual data miner survey. We have retained the core items of the current survey in order to begin tracking trends. However, we also hope to dig deeper into some of the issues that arose in the 2007 survey. In the 2008 version, we will learn more about how data miners respond to the significant challenges that they face and how these practitioners envision the future of data mining. We will also explore attitudes toward the term data mining itself, an area of increasing controversy.