Reflections in a digital mirror
Editor's note: Rajan Sambandam is president of TRC Insights. He can be reached at rsambandam@trcinsights.com. Oded Netzer is the Arthur J. Samberg professor of business and vice dean for research at Columbia Business School. He can be reached at onetzer@gsb.columbia.edu.
Let’s start at the most basic level with the need for consumer insights. It’s a truism that uncertainty impedes decision-making. Hence, business research is a process for reducing uncertainty enough to help make decisions. Insights research generally proceeds as a series of steps with the number of steps depending on the amount of information at hand and the degree of uncertainty. But it also depends on practicality, whereby business decisions are often made with less-than-desired information. In some cases, research is not even attempted because of perceived hurdles and decisions are simply based on intuition or existing knowledge/perceptions.
For instance, in a domain with high uncertainty the resolution of the business problem could start with analyzing secondary data (assuming availability and access, which in many cases is absent). This could be followed by a round of qualitative research to investigate the domain, generate ideas and draw boundaries. Next up could be an exploratory quantitative study to flesh out and understand generalizability of concepts, followed by a test of features and segments and finally perhaps message testing.
But the business decision maker often does not have the time, resources or patience for all these stages. As a result, steps are eliminated or condensed, leading to suboptimal business decisions. In much consumer research practicality (budget, time) means that often a single step (such as a survey-based quantitative study) is the only one used. With a bit more luxury, a qualitative stage may be included. While experienced researchers know that uncertainty can be reduced through several means and steps, there is just no way to do it within time and budget.
Now, a skilled marketing researcher – particularly one who specializes in a specific domain (such as health care) or a specific marketing research tool (e.g., conjoint analysis) – can leverage their vast experience to reduce effort by providing meaningful priors that may bypass some of the earlier steps in the research process, such as testing for appropriate closed-ended responses to a question. If we think about gen AI as an agent that has read thousands of marketing research reports as part of its training data, it becomes clear that it has the potential to match or even exceed the expertise of an experienced marketing researcher. After all, gen AI has likely reviewed more conjoint analysis studies than even the most seasoned professionals.
Imagine complementing this vast knowledge with primary data collected specifically to further train the gen AI tool to respond in ways that closely mirror human reasoning. This is exactly what synthetic data is used for. Now, just like we would not replace primary data collection for marketing researchers’ expertise, we would not forgo human data for synthetic data but that doesn’t mean it can’t be useful in reducing uncertainty.
The number of pathways that can be explored with synthetic data is significantly higher than in normal practice. For instance, when testing a concept, we could explore reactions of many segments, refine the concept in different ways (pricing, packaging, positioning, etc.) and develop contextual understanding of the concept’s fit into the overall portfolio and brand. The friction of practicality that prevented a researcher from fully executing the mind’s vision can be eliminated when using synthetic data. However, there is no free lunch – the ability of synthetic data to provide meaningful information is something that still needs to be rigorously explored and tested.
In sum, properly developed synthetic data allows us to execute far more steps in the research process, which can lead to significantly improved outcomes with little added time or cost. Synthetic data (done well) has the potential to reorient the entire field and provide much better insights for decision-making.
Only go so far
Now, it’s not that the value of synthetic data was previously unknown, just that execution wasn’t easy. Simple forms of synthetic data have been created and used for many years (initially for statistical model testing and later for training machine-learning algorithms). But creating sophisticated synthetic data that vary across dimensions in human-like ways is difficult. Even sophisticated simulations (as in conjoint analysis) can only go so far.
The recent revolution in gen AI has now provided a solution that is orders of magnitude better. Simply put, a large language model (LLM) is a complex network that has ingested an enormous amount of information, can connect the information pieces logically and, crucially, can communicate in simple human language.
So, the simplest way of generating synthetic data is to ask an LLM to do it by defining sample criteria (“65-year-old female, with moderate income, living in the South”). Though easy, it is just a starting point and quite an inadequate one at that. As you can imagine, profiles of such consumers can vary quite a bit with respect to almost any attitude or behavior. LLMs are more powerful when fine-tuned in some form. That is, when they are provided with contextual information (which could be qualitative or quantitative) their performance tends to improve significantly. So, in this case, further context about the respondent in question can allow the model to better mimic their behavior.
Mimic the behavior
Even with additional context, while synthetic respondents in the form described above can generally answer survey questions, their ability to truly mimic the behavior of specific populations or even segments has been found to be limited. A more sophisticated version, called digital twins, does it at the individual level with appropriate input.
The idea of digital twins is also not new. It started in the world of hardware in the early 2000s with the term itself being coined initially by NASA in 2010. Applications could include, for example, creating a digital version (or twin) of an aircraft engine to test and predict performance problems in the physical engine, or building a simulation of the New York subway system to allow analysis of alternative malfunction scenarios. The term “digital twin” in this context is very appropriate. However, when translated to the insights world we begin to see some problems.
Digital twins of individual customers can indeed be created and the creation used in much the same way as in the hardware case. So, in the above example, it could be the development of a digital twin of a specific 65-year-old female (called Sue) with a specific income, living in a specific location in the South. However, the replica is incomplete. A proper digital twin would be complemented with additional information about Sue, like her purchase patterns, attitudes and lifetime experiences that may impact her decision-making, much like the way a digital twin of the New York Subway system is fed the entire history of the system’s behavior in the past. But human behavior tends to be even more complex as it is affected by inherent human irrationality and unexpected changes in behavior.
Taken together, it means that predicting the responses of an individual consumer is very uncertain and highly variable. For example, a specific consumer can be predictably loyal to a brand – until she is inexplicably not. But she could also be wholly unpredictable with regard to other attributes, sometimes even making decisions subconsciously. As research is beginning to show, it is possible to produce digital twins with reasonable levels of predictive accuracy (within tested domains), but it would be impossible to truly replicate an individual consumer.
With marketing research beginning to explore the potential of digital twins, we should not ask whether we should use them, but rather when and how. What type of data is needed to train gen AI to respond in ways similar to humans and what kinds of questions can it realistically answer? To set expectations, at least in its current form, we believe that the answer to when we can use digital twins, even when well developed, is within a fairly limited scope of questions.
For example, imagine feeding gen AI a consumer’s supermarket purchase history. Like many statistical models, it should perform well in predicting the consumer’s future purchases of existing products. In such cases, we might be able to use the digital twin almost as if it were primary data, provided the results demonstrate sufficient accuracy. It may even be capable of predicting the likelihood of purchasing a slightly modified new product, such as a new variant of a laundry detergent. But what about an entirely new product, which is often the primary focus of marketing research? While the predictions might not be perfectly accurate, they could offer a valuable starting point for exploration, much like the initial exploratory phase of the research process. An additional question to consider is not only whether digital twins can capture average behavior but whether they can capture the heterogeneity in consumer behavior at the individual or, most likely, at the segment level.
Individual vs. group
The solution lies in what statisticians and insights researchers have known for many years. Modeling and predicting individuals is hard but doing the same for groups is easier. In fact, in most business applications, it is not the individual customer that matters but the segment, as marketing targets the latter. Though we prize individual-level data (as in conjoint utilities) we don’t really care about it beyond data collection and parameter estimation. Analyzing segment-level patterns and making segment-level predictions is often the business priority. (Yes, there are cases, such as Amazon, where individual targeting is used but that type of data and modeling is outside our scope of discussion.)
So, in that sense, the term “digital twins” is not entirely appropriate. While we may want to model Sue we don’t really care about Sue once the modeling is completed. We care about the attitudes of the segment that she belongs to (however it may be defined). The trick is to start at the level of Sue and then aggregate people like Sue into an opportunity segment that can be useful for a business decision-maker. Apart from being sound modeling practice this also has the virtue of helping reduce (though not eliminate) some of the privacy considerations that arise with consumer digital twin development.
The value of human data
Let us be clear about what we are not saying here.
We are not advocating for the substitution of human data with synthetic data (as some of the heat generated by online arguments suggests). Use of human data is the primary goal. The use of synthetic data is to help reduce uncertainty, refine concepts and reorient research such that the study with real human data has the best chance of succeeding. Synthetic data is a means to easily leverage existing knowledge through a familiar format, as well as develop ideas thoroughly before using precious and scarce human data. Specialists as varied as surgeons and astronauts practice and refine their technique in simulations of reality. Synthetic data offers a similar luxury for insights professionals.
Let’s address a specific – and inappropriate, in our mind – use case that some people advocate for synthetic data, namely, employing it to complete a desired quota group especially for hard-to-reach audiences. So, for example, let’s say 200 completed surveys were required but lack of sample availability (and other problems) led to only 170 completed and validated human surveys. Could 30 synthetic respondents be added to get to the target quota of 200?
To see why this use case is not appropriate, one should ask the question of why, and what we accomplish by doing this augmentation. One straightforward way to think about it is in terms of margin of error. It’s true that the error with 200 is (slightly) lower than 170. But why stop at adding 30? Why not add 300 or 3,000 or more? Clearly cost and timing are not issues.
This leads to the fundamental flaw. Margin of error (or sampling error) reflects the gap between sample and population. That’s why a census requires no statistical testing as the error drops to zero when sample equals population in size. But the sampling precision offered by synthetic data is an illusion. Augmenting human data with synthetic data masks uncertainty rather than reducing it and risks distorting insights rather than enhancing them. Without a firm grasp of these principles researchers can make mistakes that could undermine the integrity of the research itself.
This objection is statistical and does not even consider that (as discussed earlier) synthetic data are generally inferior to human data. Models know past patterns but not necessarily future intentions, especially with regard to new features that may interest people. Models are also only trained up to a certain point in time, which can limit their horizon with regard to real-world events.
Keeping synthetic and human data separate allows each one to operate in its best zone. In fact, while the term “synthetic data” is commonly used, we agree with the critique that it is not really data in the traditional statistical sense of the word, just as simulations are not really data. But that does not negate their usefulness in any way, as we have discussed in this article.
High barrier for excellence
As aptly explained in a recent article in the journal Science, LLMs can be seen as a convenient way for humans to take advantage of information that other humans have accumulated. With convenience comes a caveat – synthetic data have a low barrier for entry but a high barrier for excellence. It is not always clear that a significant amount of thought and work is required to develop synthetic data that can help rather than hinder the process of obtaining superior consumer insights. The main tenet of marketing research – uncertainty reduction, rather than cost reduction – should not be forgotten.
We should resist the temptation to use synthetic data simply because it is nearly costless. Bad marketing research will eventually turn out to be very expensive. Instead, we should ask ourselves how we create proper synthetic data and for what questions it can meaningfully help reduce uncertainty. Among the variables to consider are: type of model used; prompting protocols; input information used; and the extent to which outcomes can be extrapolated. There is a considerable explanation underlying each of these (and other factors). We recommend reading the quickly evolving academic literature about the potential of using synthetic data (such as Blanchard et al.) for more detail.