Editor's note: Bryan Orme is president of Sawtooth Software, Provo, Utah.  The author wishes to thank David Lyon, Keith Chrzan and Tom Eagle for their critiques of earlier drafts of this article.

Researchers often are asked to measure the preference or importance of items such as product features, claims, packaging styles or health risks. Popular approaches include rating scales, constant-sum tasks and best-worst scaling (max-diff).

The standard five- and 10-point rating scales are fast and easy but are plagued by low discrimination among items and scale-use bias. With rating scales, a much more important item won’t necessarily get a much larger score. Respondents are tempted to yea-say (positivity bias) and straightline. Different cultural backgrounds can influence the way respondents use the scale. When comparing groups of respondents, many true differences may be obscured due to the messiness and bias in rating scale data.

With constant-sum tasks, respondents are asked to distribute (say) 100 points across multiple items. Many respondents struggle with this task and the results can be noisy and imprecise. Constant-sum tasks are also difficult for respondents to do with more than about 10 items.

Best-worst scaling (BWS) shows sets of items, typically four or five at a time (Figure 1), asking for each set which item is the best and worst (or most and least important) (Louviere 1991, Finn and Louviere 1992, Louviere et al. 2015). It is typical to show each respondent eight to 15 BWS sets, such that each item appears at least once and preferably two or three times per respondent in a balanced design. Often 20 or more items are included in a BWS study.

BWS scores show greater differences among the items and the results are more predictively accurate of held-out information (Cohen and Orme 2004, Chrzan and Golovashkina 2006). You will find a greater number of statistically significan...