The rise of synthetic data
Editor’s note: Christopher Barnes is president of Escalent, a research and advisory firm. He has more than 35 years of experience in market and public opinion research, including cofounding the Center for Survey Research and Analysis at the University of Connecticut. His studies have appeared in The Wall Street Journal, USA Today, The New York Times and Time cover stories. Find Barnes on LinkedIn.
Synthetic data may feel like the newest thing in research, but it’s been quietly around for decades. Back in the 1980s, researchers were already experimenting with model-based imputations to statistically replace survey data. By 1993, fully synthetic datasets were being proposed to protect census microdata.
What is new is the hype.
If the current moment feels familiar, it should. Synthetic data today looks a lot like the early internet – full of promise, moving quickly and occasionally getting ahead of itself. Across the industry, market researchers are talking about faster workflows, lower costs and near-limitless scalability. The potential is real. But turning that potential into dependable, real-world applications? That’s still a work in progress.
At the same time, technology itself has come a long way. Advances in machine learning and large language models have improved both the quality and accessibility of synthetic data, opening the door to more use cases than ever before.
Which brings us to the question that matters: What does responsible, practical use look like?
The answer isn’t complicated, but it does require discipline. It comes down to knowing where synthetic data adds value, where it introduces risk and how to use it without compromising research rigor or decision-grade standards.
Where synthetic data adds value (and where it doesn’t)
At its simplest, synthetic data is artificially generated data designed to mimic the real world. When it works, it works because it’s grounded in high-quality human data and applied to a clearly defined problem.
Used this way, it can be genuinely useful. It helps researchers move faster, test at scale and extend existing datasets. It’s particularly effective for augmenting or diversifying samples, simulating environments and modeling niche or hard-to-reach audiences. In these situations, it can reduce field time, lower costs and improve coverage. And since synthetic respondents don’t get tired, researchers can push further – asking more questions and refining hypotheses in ways that would be impractical (or too expensive) with live samples.
But it has limits.
Synthetic data is not a substitute for primary research. Without a solid grounding in real human data, it quickly loses fidelity and becomes a collection of modeled assumptions. It’s also less reliable in areas with little precedent or where the underlying data is sparse or biased. Add complex surveys with intricate skip logic or tightly interdependent variables, and reliability can degrade further, especially as models become more complex.
In other words, synthetic data is a tool. A useful one, but not a universal solution. The real skill lies in knowing when to use it – and when not to.
Anchoring synthetic outputs in human data
The safest way to think about synthetic data is as a complement, not a replacement. In most cases, it should make up a minority share of the dataset – typically somewhere between 5% and 20%. And like any engine, it needs fuel. Without regular infusions of fresh human data, models risk drifting over time.
Working this way also changes how researchers interact with their data. Large datasets have sometimes made it easy to skip deeper interrogation of structure and quality. Synthetic workflows don’t allow that. They force a closer look at which variables matter, how populations are defined and where representation may be missing.
Consider a simple example. Expanding a dataset from 100 human respondents to 300 total records using synthetic data might look like a stronger sample. But 200 synthetic records are different from 200 new human interviews. They are, by design, look-alikes. That means they don’t reduce sampling error in the same way additional fieldwork would.
To make those synthetic records useful, researchers first need a clear understanding of the original 100 respondents – who they are, what variation they capture and what might be missing. Without that foundation, synthetic data risks reinforcing existing assumptions rather than adding new insight.
This is why validation is nonnegotiable. Synthetic outputs can’t simply be taken at face value. They need to be tested against benchmarks, human samples or real-world outcomes to determine whether they hold up.
Synthetic data as a tool in the research toolkit
Synthetic data earns its place alongside other research methods – it doesn’t replace them. Its value depends entirely on how well it’s matched to the problem at hand.
Our early use cases have shown promise. In consumer research, synthetic data has helped offset the cost of reaching hard-to-access audiences. In telecom, it has enabled teams to extend datasets in time-sensitive studies, quickly expanding samples from specific geographies without sacrificing confidence in the results. In both cases, synthetic data strengthened existing workflows rather than replacing them.
That distinction matters.
Too often, conversations about synthetic data focus on what technology can do, instead of the problem researchers are trying to solve. A better approach is a familiar one: define the problem, build a hypothesis, choose the right methodology and then test whether the results are strong enough to inform decisions.
Synthetic data can absolutely accelerate workflows and help scale datasets – but only when it’s applied with the same rigor as any other method. It still relies on human expertise to guide its use, validate its outputs and ensure alignment with real-world behavior.
If anything, the rise of synthetic data makes the role of the researcher more important, not less. The responsibility remains the same: to use the right tools, in the right way, to deliver insights that decision makers can trust.