Mitigate survey fraud before exploring synthetic data solutions 

Editor’s note: Steven Snell is SVP and head of research at Rep Data. 

Survey fraud, including responses from click farms, hyperactive respondents and tech-enabled fraudsters, threatens the quality of quantitative research. If synthetic data is trained on these flawed inputs, it won’t just replicate errors, it may amplify them. Before synthetic data can deliver real value, the industry must first tackle the underlying challenge of ensuring high-quality human responses.

Understanding the source of synthetic data  

In very general terms, synthetic data are generated through statistical imputation. Skilled researchers leverage what they know about respondents – including central tendencies of demographic groups, buyer segments, personas and so forth – to make inferences about how those respondents might react or rate additional brands, products or stimuli.

Imputation makes several assumptions about the consistency of respondents’ preferences and the predictive importance of respondents' characteristics. However, one of the most important assumptions is that training data are reliable. This assumption is especially tenuous in the modern research environment, characterized by low-quality survey data, especially from fraudsters misrepresenting who and where they are. 

For better or worse, imputed responses imitate the data on which they are trained. If “seed” or “training data” are of high quality, you can more confidently use those data to impute missing data and synthesize new responses. On the other hand, if seed data come from low-quality survey responses, synthetic data will imitate and potentially exacerbate the biases in those seed data. 

Synthetic data may not be ready for showtime

Our experience scanning and scoring survey traffic raises major concerns about the suitability of many survey data sets for training imputation models. Last year, our fraud detection tool observed nearly 3 billion survey attempts. It recommended blocking about 30% of respondents to most projects, including known fraudsters, respondents manipulating their digital fingerprint, duplicate respondents and hyperactive survey respondents. Our internal research has repeatedly found that these respondents contribute bias to core metrics about brands, products, politics and more. 

If the algorithms that generate synthetic data are trained on data sets replete with fraud and bias, synthetic data will perpetuate these errors. In sum, for synthetic data to be useful to market researchers, we need to know how models are trained and much more about the quality of seed data leveraged to generate synthetic responses. 

How to build a path forward with smart use of synthetic data

Despite these concerns, synthetic data has potential, if it’s used responsibly. The key lies in ensuring that models are trained on high-quality human responses. This requires a multipronged approach:

  • Start with quality respondents: Researchers must ensure that the data used to train synthetic models is thoroughly vetted, with strict quality controls in place to filter out fraud and disengaged respondents – and solid sampling practices to find the best participants for the research at hand.
  • Combine synthetic and real-time human responses: Synthetic data should supplement, not replace, real-world consumer feedback. Regularly incorporating fresh, high-quality responses ensures that models remain aligned with actual consumer behavior.
  • Increase transparency in data training: The industry must set clear standards for how synthetic models are built, including disclosure of training data sources and validation methodologies. Without visibility into these processes, researchers risk making decisions based on unreliable insights.
  • Monitor for unintended distortions: Synthetic data sets should be continuously tested against live survey results to check for over-smoothing, bias or drift. If synthetic responses diverge significantly from real consumer data, adjustments must be made.

Creating a strong foundation for synthetic data

Synthetic data is not inherently bad, but it needs a strong foundation. Quality inputs are necessary for imputation models to yield quality outputs. Given present concerns about survey fraud and poor-quality responses from disengaged participants, researchers must proceed with caution.

Rather than treating synthetic data as a standalone solution or a replacement for insights from surveys with humans, researchers should view it as a tool. High-quality data from human respondents should be both the basis on which synthetic respondents are built and the truth benchmarks to which synthetic respondents are routinely calibrated. As such, smart sampling, rigorous validation of respondents and quality fraud mitigation technology will be critical to build synthetic data without compromising accuracy. There will always be uncertainty in imputation, but we want that uncertainty to come from the ambition of the model rather than quality concerns with the training data.