Editor’s note: Isaiah Adams is the manager of social media development at marketing research and analytics firm Optimization Group, Mich. This is an edited version of a post that originally appeared here under the title, “Explaining data mining in useful language.”
For those not familiar with data mining, simply the mention of the term often leads them to mentally check out. The term is carelessly thrown around, leaving the definition unclear. The truth is, the subject is full of jargon, tedious detail and complicated math but if you can understand some of the basic concepts it can be extremely valuable.
To provide insight into this important topic I enlisted the help of Optimization Group’s Director of IT, Jim Kenyon. We tend to look at this from a marketing perspective. In other words, what does the data tell us about our client’s marketing?
Before we look at some of the fundamentals of data mining, it’s important to understand one principle: data mining is most effective when you ask specific questions. We like to use the illustration of peeling an onion. By peeling the onion one layer at a time, you can see more of what’s really going on with your marketing and ask more specific questions.
To help you peel back the onion and better understand data mining, specifically as it relates to marketing, we’ve asked Kenyon to answer important data mining questions in language that’s easy to understand. Each question is designed to help you get the most out of data mining and understand what it takes to get started.
How much historical data do you recommend a client has before they can do any meaningful data mining?
We want as many observations as we can get, though we’ve had reasonable success with three years of monthly observations (36 observations). The quality of the model goes up with more data (to a point).
Is there a recommended time period classification (observation frequency)? In other words, does it make a difference if the data is recorded daily, weekly, monthly or annually?
Models that use macro-economic data tend to use monthly sampling as most econometric data is reported monthly. It’s a least common denominator.
How do I know if my data is in good or bad shape? What are the indicators?
Do you have monthly observations across a continuous time range? Are there valid values for all observations? Valid values are values that are within the expected range for a field. For example, if the field is “age of person,” negative numbers would not be valid. Missing values are another case – for example, if TV spending is missing is it because there was no spending (in this case it would be a zero, not missing) – or is it because accounting lost the data for that month?
Next, is the data recorded on the same scale/unit of measure for each observation? Does the format (file layout) of the data vary from observation to observation (or year to year)? Is the data recorded in a common file format (CSV, Excel, fixed column width)?
If someone answers no to more than one of these questions, chances are the data is in bad shape and will need significant work. It doesn’t necessarily mean that their data is unusable – on the contrary, most of the engagements we see have data that’s in pretty rough shape.
If my data is in bad shape, how is it cleaned and prepared for data mining?
Data scientists can restructure the data and load it into a relational database. Once in the database, it is transformed into monthly observations of features. Descriptive statistics are calculated for each feature. Descriptive statistics are things like mean, mode, standard deviation, median, frequency distributions, etc.
Plots are created for each feature. Typical plots are frequency distribution and time series (value of a feature by time period, from the start of the range through the end, in chronological order). These statistics and plots are reviewed and cross-checked with original (raw) data to make sure the transformation did not alter the data. The data review is then conducted with client data stakeholders/providers to check for and explain anomalies.
How does the quality of my data affect any potential data mining project?
Poor data quality can reduce the predictive accuracy of a model. It can, in the extreme, prevent model development entirely.
I’m not sure where all my data is. Where is data most commonly stored?
In the sock drawer. Seriously, in an ideal world all data is stored in a data warehouse. More commonly, it comes from spreadmarts – an Excel spreadsheet from Bob in finance; another one from Jill in media planning; a CSV from some legacy mainframe application; and three external SaaS applications that two different guys in sales bought because they saw them in an airline magazine.
What are the most overlooked and under-appreciated aspects of data mining?
Data mining doesn’t require big data. That is, you don’t need millions of customer records to take advantage of the power of machine learning techniques. Monthly marketing spending and sales data over at least three years can produce very useful models to improve the effectiveness of your marketing dollars.
How are outside variables (weather, MCSI, etc.) incorporated into a data mining project? How is the impact of outside variables measured alongside of internal marketing variables?
External variables are included at the same observation frequency (typically monthly) as data is provided. The machine learning tools include these features while constructing the candidate models and determine if any of them are contributing the predictive accuracy of the model. If they are contributing, they are included in the model. If not, they aren’t. They are not treated differently.
Does it matter how many variables are in a data mining project? How does this affect time, cost, etc.?
There is a limit based on the number of features mining tools can handle (varies by tool). The number of features is reduced through “feature selection” – an iterative process that looks for features that tell the same story (for example, temperature reported in Celsius and Fahrenheit – the second copy doesn’t add information to the model they tell the same story but in different units) or are highly correlated. Only one copy of such a group of features is carried into the modeling phase.
Additional features that are not in analytic-ready (one record per observation period, with a variable being an additional column in said record) add to extract, transform and load time. This can be expensive if the data requires significant work to get it into an analytic-ready format.
10. What’s the first step in every data mining project?
The first step is to understand the business problem being solved. If this step is ignored or given short shrift, one ends up with a very good answer to the wrong question.
11. What is typically the client’s role in a data mining project? What things fall on the client?
Defining the business problem; providing data only the client can provide; (to the extent possible and/or desired) delivering client data in an analytic-ready format; reviewing data during ETL to help make sure the process didn’t introduce errors; and reviewing candidate model(s) to see if they make sense.
12. What steps are taken to make sure the data model delivered is the most accurate model?
Interestingly, we tend to ignore “the most accurate model” as these tend to be precisely wrong rather than generally accurate. That is, they can suffer from “over fitting.” Rather, we look for candidate models that: make sense; are explainable, simple and have good accuracy; and are biased in the way that best suits a client’s needs. For example, it’s better to have a model that includes some false positives when sending direct mail advertising pieces than to have false negatives. In this case, you spend a few extra cents per piece to people who won’t respond rather than not send to people who would respond and generate revenue.