Listen to this article

The limits of synthetic data for consumer research.

Editor's note: This article is an automated speech-to-text transcription, edited lightly for clarity.  To view the full session recording click here.

Synthetic data has some great qualities and some concerning qualities. Verasight set out to find out exactly what is great about synthetic data and what researchers need to watch out for.

During Quirk’s Event – Virtual Global 2025, Ben Leff, CEO at Verasight, shared the results of testing the company did around the accuracy of synthetic data and some of the promises around it.

Session transcript

Joe Rydholm

Hello everybody and welcome to our session, “The limits of synthetic data for consumer research,” I'm Quirk’s Editor, Joe Rydholm. Thanks so much for joining us today.

Just a quick reminder that you can use the chat tab if you'd like to interact with other attendees during the discussion, and you can use the Q&A tab to submit questions to the presenter and we'll get to as many as we have time for at the end.

Our session is presented by Versasight. Ben, take it away.

Ben Leff

Great, thank you so much for having me and it's great to be here. I'm Ben Leff. I'm the Co-Founder and CEO of Versasight.

Quick background on Versasight.

We really started the company to expand access to high quality survey data for academic researchers. We have our own verified panel of respondents that we recruit using probability methods and online targeting. Then we verify every respondent.  

And because most of our clients are academic researchers, one question that continues to come up is, is synthetic data effective? Does it work?  

As corporate researchers, you're probably bombarded with requests for trying out synthetic data, promises of it. So, what we're going to present today is leveraging some academic research and some of the research on our team really to show our findings on synthetic data, particularly regarding the limitations, which might be a counter to some of the prevailing narratives.  

As I already flagged, researchers are inundated today about synthetic data's promise to revolutionize the industry. And if it works, it would be amazing. It would save you money, it would save you time, it would let you reach groups previously out of reach. So, we understand why there's such a need to try to make synthetic data work. 

What we're going to present today is how we're evaluating it and some of the initial research we're seeing.  

I want to flag, and I'll flag this again at the end and we have it in the chat as well, but as a thank you to coming to the session, we'd love to include your question in our next synthetic data experiment. 

So, what this means is if you scan this QR code or use the link in the chat, you'll be able to add a question to our next national omnibus survey. You'll be able to get data from a thousand real verified respondents from the Versasight panel. And we're also going to generate a thousand synthetic respondents. You can see firsthand the differences between synthetic responses and real responses for your question. 

I think this will be a lot of fun. I say that as a nerdy researcher, hope you'll find that fun as well. But please go ahead and add a question and let us know if you have any questions on the portal as well. My team can help out and we can discuss at the end.  

Okay, so a few key priorities during today's presentation and how we approach synthetic research. 

First, what does it mean to generate synthetic research? I think it's important to set the framework. There's so much terminology thrown around. I just want to provide a clear definition.  

Next, I want to look at where synthetic data performs well and where it's truly limited.   

Then lastly, I want to just look at a few emerging research topics. One research area we're interested in is, Does providing more information to the model necessarily improve the performance of synthetic data? We'll look at that a little later in the presentation.  

So, quickly in terms of how we're going to approach our analysis, this is what we did across several white papers that we'll reference in today's presentation. 

First, we conduct a nationally representative survey of 1,500 Americans using our verified panel standard, survey research approach. The key is nationally representative, so it's balanced on census demographics, partisanship, 2024 vote.  

We then use an LLM to generate fake respondents using the demographics and political variables for the real respondents in step one. 

Lastly, what we're doing to give us an initial sense of performance is we're then going to compare the top lines and cross tabs for the synthetic sample and the actual sample.  

So, let's begin by just looking at one of the specific prompts we use to generate this synthetic survey data. 

First, we just give a general prompt, “Your job is to substitute as a human respondent.” Then we give the persona that we want the LLM to mimic, and then we give them the survey question and the response options. 

Here's an example on the right-hand side. Your a 61-year-old white woman, here's your education, here's your income where you live in the country. And there's the question. 

I want to flag that this up here, the demographics are real. That's from the actual survey respondent. We're then telling the LLM, here's what we know about this person, please mimic them. Then we give them the survey question.  

This is just one persona, we're repeating this 1,500 times. So, we're repeating this for every single person in our nationally representative sample. 

At a high level, this is what we mean by generating synthetic samples. This is what we'll do if you add a question to that survey, which I mentioned. We'll talk about variations on the prompt, but this is just the overall approach to generating synthetic data.  

So, first what we see is at the top line level, if it's a frequently asked question and a very polarized topic, the synthetic data tends to do pretty well. 

In this case, we're asking probably the most asked survey question in the country, ‘What is Donald Trump's approval rating?' 

We found that the synthetic responses mimic real responses and have an error of about 4 percentage points, which on the whole is not bad. So, you'll see Trump approval error is 1%, Trump disapprove error is about 5% and don't know error is 3%. So, that's how we average it out.  

Importantly, the LLM does not say, ‘I don't know.’ So, as researchers, that's important to keep in mind whenever you're researching a topic that humans just might be new to or might be unsure about. LLMs are designed to give an answer.  

Now we must turn to the words of caution. 

The people who tout synthetic data typically tout findings like this. Look how closely it can approximate high level top lines.  

Let's begin with a few words of caution. First, this is the most asked survey question. So, when you think of the training data that goes into an LLM, it's going to have a lot of preexisting surveys from real responses asking this same question. So, it should do pretty well.  

But now it gets really concerning for us as researchers when we want to look at more nuanced responses that are top line or cross tab patterns.  

Here's one example here, the word of caution, it averages out to about a 4% error. But when you look at the data, about 20% of people who say they oppose Trump in real life were predicted by the LLM to support him. And then among disapproves, the error was the same in the opposite direction.  

What that means is you're averaging out to something that looks sensible, but when you look under the hood, significant errors really start to emerge. 

Let's dive into this. First, performance quickly fails at the cross-tab level and us as researchers care a lot about understanding the nuances of consumer research.  

So, here we look at African-American respondents Trump approval, the error rate is 15%, and then young respondents, 18 to 29, the error rate is 8%.  

It's important to understand that by definition, an LLM is designed to be predictive of the average response and that is at odds with what we want to do as researchers if we want to look at cross tabs or if we want to analyze the nuances of individual responses or groups of respondents.  

Immediately, once you get beyond the top line level, significant errors start to emerge.  

Now we look at even more a variety of additional cross tabs, and you'll see the average error between the true proportion of Americans who disapprove of Trump and what the synthetic sample will tell us is about 8 percentage points. 

It shows that as the group gets more narrow, as you're trying to get a more nuanced understanding of consumers, the error continues to grow, which is problematic for researchers.  

Next, we want to turn to consumer research because it's very relevant to probably most folks on the call today and a lot of our clients as well. We find that the same pattern holds where synthetic data is really failing to understand the nuanced consumption patterns, regional preferences, and demographic variations of consumers.  

The way I like to think about this is the error is lowest for national brands.  

For example, ‘Herd of Starbucks.’ In reality, about 91% of respondents say they have. Synthetic samples, about a hundred percent Starbucks awareness. It makes sense, national brand, lots of data error rate is smaller.  

Where error really emerges is as you grow it more granular.  

So, one example of this is Pete's Coffee, which is a regional coffee chain. I actually have it right here. And you'll see that it overestimates awareness by about 26 percentage points. 

Pete's dominant in California for instance, not well known in certain areas of the country. And the LLM is really overestimating what the typical American knows about Pete's coffee. 

 This chart is just plotting the error rates for all these variety of questions. And as researchers, we should be really concerned if we're going to leverage synthetic data when doing our own market research, are trying to make conclusions about consumer behavior. 

So, now let's turn to product testing because one of the other promises of LLMs is can it be used for product testing, just quick test names for instance, maybe it has a gut instinct on consumers.  

What we did here is we came up with several fall coffee drinks, time of the season. We asked real respondents which they would prefer, and then we asked the LLM, which they would prefer.  

For some reason the LLMs synthetic data's obsessed with the name Autumn Ember. If anyone has theories for that, we'd love to hear it. But consumers are far more nuanced, pretty much tie between Harvest Velvet latte, Golden Gore Brew, Jacko Latte and Autumn Ember.  

So, really large Miss and I think it's showing even for high level product testing, naming conventions, can we get a gut instinct from synthetic data? The answer from our research is no.  

Now let's turn to the question.  

One of the areas of pushback we get in our research is, ‘oh, you just have to provide the model with more information or use a more updated model.’ So, we said, "Great, let's test it.” 

We took three steps to try to improve the model's performance.  

First, we continue to leverage the latest model. So now we'll cite some data from ChatGPT-5. Next, we will give the model additional information about people's actual voting behavior.  

So, one thing we do at  Verasight is we match respondents to the voter file. So, we know their voter history. Did they turn out to vote in various elections? We'll give that to the LLM without giving it any personal information.  

And third, we'll provide the LLM with additional survey responses from those 1,500 people to see, does that improve performance.  

What we're seeing here is that the LLM sample does not consistently improve with all these additional variations.  

Here are the four different ways we tested it in this case. 

The first row is just an older model, ChatGPT-4. 

Next approach that we use is chain of thought reasoning, I'll talk about that in a second. Plus, voter file data, administrative data, plus the old model ChatGPT-4.  

The third variation is chain of thought reasoning, voter file and ChatGPT-5.  

Then the last where the model has the most information, latest model, voter file attitudes, chain of thought reasoning.  

Chain of thought is just when you give the prompts even more specific information. Think about this step-by-step, explain your reasoning to us. And there's been some research to suggest that in theory that should improve performance. 

Here's what's scary for researchers. Across three different survey questions, three different variations gave different optimal ways of building the model. That's really concerning when we're trying to optimize the use because there's no universal way of improving performance.  

For example, this first question is looking at Trump approval.  

Here the error is the lowest for the least sophisticated of the methods we tried just the old ChatGPT-4 without additional model prompting and additional administrative information for the generic ballot. That's just, ‘Who do you want to be in Congress next year, Republicans or Democrats?’  

The best performing model is the one with the most information, chain of thought, voter file, attitudes, ChatGPT-5.  

Then immigration. Just a random question, assessing respondent's immigration attitudes. The third variation is the best performing.  

This is cited in the literature first by Baumann et al. This is what's called “the forking path.” It's one of the key problems in synthetic data and using AI research for market research. 

What it suggests is how important your choices are as researchers and how they lead to radically different conclusions.  

Now, if there was a clear way where this model has a better performance, always feeding in this information, lead a better performance that would give us more comfort that we're moving in the right direction and generating more accurate responses.  

But the concept of the forking path, which is demonstrated here, is that the choices we're making as researchers, which model to use, what information to give the model, how to prompt it, have really significant conclusions to the end result. That should be concerning when we're trying to identify what's an objective way of assessing performance. 

So just a few key conclusions and then I'm really looking forward to diving into people's questions. 

First, at a high level and when you see all the quotes about how amazing synthetic data is and why you should be using it, what it's probably doing is it's approximating real world population proportions on highly polarized or very frequently asked questions.  

So, Trump approval is a great example of that, probably the most frequently asked survey question, lot of existing data already there and relevant for researchers to leverage.  

Second, what I want you to take away though, is that synthetic data really fails catastrophically at capturing the nuanced consumption patterns, regional preferences and demographic variations that drive real purchasing decisions and real-world behavior.  

So, whenever you're being asked, look at this promise of synthetic data, what I would encourage you to do is ask to look at some of the cross tap level data to see if it really is holding up when you're breaking out the data in a more nuanced way.  

Lastly, what I showed on the last slide, utilizing the latest model or providing more information does not guarantee greater accuracy. Which should also really concern us as researchers as we're trying to seek the most objective information.  

Now I want to highlight a lot of caveats. 

First, this is a really rapidly evolving field. So, we continue to update this research as new models come out.  

And this is just one approach of using synthetic data where we're giving it survey responses, asking it to generate based on existing personas. There's a whole host of other approaches that we can try as well.  

Park et al did two-hour interviews with people and used those transcripts to train the model. Similar concerns emerged from that paper as well, but just flagging that. That's a completely different approach that researchers have taken. 

Lastly, I really want to emphasize and then we'll turn to your questions, this omnibus survey opportunity and synthetic data experiment.  

I mentioned on the last slide that we're continually testing how synthetic data performs relative to real survey data. And we want to test this for you. 

We're offering this free opportunity. So, scan this QR code and I have the link in the chat as well, and you'll be able to add a multiple-choice question, and in a few weeks, you'll get data from a thousand verified respondents and from a thousand synthetic respondents.  

Why I love this is that you all have pretty diverse and nuanced research preferences, so this is a great way to see for my specific research topic, can synthetic data get close to what I'm interested in?  

We'll give you the top lines and then we'll also give you all the raw data as well, so you can dive in. 

With that, I'll leave this up. Then would love to open it up for questions from the group.