Rethinking AI concept testing
Editor’s note: Jodie Shaw is the head of global marketing at Kadence International, where she leads global brand, content and demand-generation initiatives for the company’s market research and insights practice. With more than two decades of experience in B2B marketing, she has built a reputation for turning complex services and emerging categories into clear, compelling market propositions.
In the early days of predictive email, many users found the idea unsettling. The thought that software could finish a sentence before the writer had even reached the final word raised more suspicion than enthusiasm. When researchers asked respondents whether they wanted such a feature, reactions were often lukewarm. Some worried it would guess incorrectly. Others saw little need for help composing a short message.
Yet within a few years, predictive typing had become routine across email and messaging platforms, quietly saving billions of keystrokes each day.
The research signal and the real-world outcome moved in opposite directions.
That gap is not unusual in product research. People often struggle to judge innovations where the value only becomes obvious through use. Concept testing rests on a simple premise: present an idea, ask respondents to imagine it, then measure appeal and intent. In many categories, that premise holds. A new beverage flavor, a redesigned appliance or a faster mobile network can be evaluated immediately.
Artificial intelligence breaks that pattern. Its value often appears only after the system begins assisting with small decisions that people previously handled themselves.
Traditional concept tests struggle with this type of product because they compress an evolving experience into a static description. Respondents can only evaluate what they can see on the page: a tool that predicts, recommends or automates something. What is often being assessed, underneath the stated questions, is whether the respondent feels comfortable letting the system take a share of responsibility.
In conventional product tests, usefulness carries the weight. In AI products, adoption often hinges on whether users are willing to delegate part of a decision to the system. The evaluation shifts from “Would this help?” to “Would I allow this to act for me?”
Most concept tests are not designed to separate those judgments. Automation is presented as a feature rather than as a gradual transfer of control between a person and a system. Without experience, respondents tend to imagine extremes: flawless performance or a costly mistake. The everyday reality is usually quieter. Most AI tools sit somewhere in between, improving with corrections and requiring occasional oversight.
This creates a measurement problem hiding in plain sight. Features that remove cognitive friction can score poorly because respondents underestimate the burden of the small decisions they remove. Yet once deployed, those same features can become routine and, in time, indispensable. In some cases, the research signal may discourage development of the very capabilities users later rely on most.
Why traditional concept tests misread AI
Concept testing works best when a benefit is immediate and easy to picture. The respondent reads a description and can quickly map it onto something familiar. AI concepts are harder because they describe systems that change behavior over time. The promise is not a single feature – it is a new division of labor between person and machine.
Concept tests also encourage literal interpretation. When respondents encounter an unfamiliar capability, they often jump to edge cases. Respondents often imagine the same extremes. In practice, most tools land in the middle.
The deeper issue is that concept tests often blend two evaluations into a single score. A respondent might believe the system would be useful and still feel reluctant to delegate. That reluctance then shows up as low appeal or low intent, even though what the respondent is actually signaling is discomfort with autonomy.
That is why incremental product improvements can outperform AI concepts in early testing. A clearer dashboard or faster workflow reads as low risk. AI concepts introduce a new question: How much authority am I being asked to hand over?
Start with the job, not the algorithm
One reliable way to reduce distortion is to change how concepts are framed. AI concepts often lead with the technology. The moment “AI” becomes the subject of the sentence, respondents begin interrogating the system – accuracy, error rates and loss of control.
A more productive approach starts with the job. Most professionals recognize the friction of coordinating meetings across multiple calendars. Emails move back and forth, suggesting times. Someone declines, and someone else proposes another slot. The task is small, but it drains attention, especially when repeated over the course of a week.
When the workflow is made visible, an AI system that removes it becomes easier to evaluate. The respondent is no longer assessing abstract automation; they are assessing a familiar task disappearing.
In practice, strong AI concept framing tends to follow a sequence.
- First, define the job currently being done.
- Second, surface the friction in that job, particularly the repeated micro-decisions that accumulate.
- Third, introduce the system that removes or simplifies the task.
- Finally, show where the user retains oversight through approvals, corrections and clear ways to override.
The point is not to reassure respondents with marketing language. It is to reflect how successful AI products actually behave – assistance first, then gradual delegation.
Permission to automate is the real metric
Even with better framing, one question determines adoption: How much authority are users willing to grant the system?
That boundary can be measured, but it requires moving beyond a single concept score. Permission usually develops in stages.
At the first stage, the system is observational. It suggests actions, but the user remains responsible for every decision. Recommendation engines operate here.
The second stage involves assisted automation. The system performs tasks but asks for confirmation. Expense categorization, document tagging and scheduling assistants often sit in this zone.
The third stage is full automation. The system executes decisions independently, typically where speed or scale makes constant human approval impractical, such as fraud detection or automated inventory ordering.
Many concept tests flatten these stages into one question: Would you use it? But the boundary users are negotiating is more specific. People may welcome suggestions, accept assisted automation quickly and still reject full autonomy. When researchers separate the stages, resistance often narrows sharply.
Researchers should stop treating “automation” as a single variable and start testing thresholds – what users will delegate, what they will approve and what they will not allow without oversight.
Simulate interaction, not description
AI value is experiential. It shows up through repetition rather than in a paragraph of text. That is why scenario-based testing tends to produce more reliable readouts than static concept statements.
Instead of presenting a description and asking for a score, respondents can be shown a short progression. The system observes behavior and it offers a recommendation. The user approves the action. After several successful interactions, the system begins performing similar tasks automatically.
Now the respondent is evaluating something closer to real use: how their comfort changes as the system behaves consistently.
Explanation design matters here. Trust is influenced not only by performance, but by whether users understand why a decision was made. Systems that provide reasoning cues, confidence indicators and easy override mechanisms tend to receive delegation earlier than systems that operate as black boxes.
The aim is not to persuade respondents. It is to recreate the conditions under which trust would realistically develop.
AI research requires a measurement shift
AI road maps are everywhere, but many research methods still assume customers can judge a feature based solely on a static description. That worked when benefits were immediate and visible.
Automation changes the equation. When a system begins to share responsibility for decisions, adoption depends less on how useful the feature sounds and more on how much authority users are willing to grant it.
The risk is that concept testing designed to measure usefulness will misread hesitation about delegation as a lack of demand. Features that quietly remove cognitive work can look underwhelming on paper. In the real world, they can become foundational.
Concept testing does not need to be discarded. It needs a different target. The goal is to reveal where users draw the boundary between assistance and autonomy, and what conditions move that boundary.
That is what AI concept validation looks like when the benefit is real, but hard for customers to articulate before they experience it.