Data Use: Darwinism and data | Articles

Editor’s note: Andrew Jeavons is president of MI Pro USA Inc., a Cincinnati-based software firm.

Market researchers usually use some form of data when they perform studies for clients. What constitutes data is an interesting question, and the nature of data and how we represent it has some insights into how we need to work in the future.

Data has a long history, and its history is linked to the development of writing. Research into the development of the first writing has been going on for some years, and it is an interesting fact that the first writing wasn’t motivated by anything other than the needs of commerce. The motivation wasn’t to record epic poems or laws or even history but the need to keep track of goods and livestock. Without business there would arguably have been little culture!

A book called Archaic Bookkeeping by Nissen, Damerow and Englund provides a fascinating insight into the origins of writing. The cradle of writing was Mesopotamia, the earliest items coming from Uruk in lower Babylonia, now Iraq .

Denise Schmandt-Besserat has written extensively on the earliest forms of data records, in around 8000-3000 B.C. Neolithic clay tokens, or bullae, seem to have been used as a commercial record-keeping system. The principles were very simple: six bullae meant you had six chickens (or were owed them or sold them). Different shapes (cones, ovoids, spheres) may have represented different measurements of grain or another commodity. The path of evolution was from simpler tokens to more complex ones differentiated by shape. Schmant-Besserat says that the appearance of the more complex token system seems to have correlated with urban formation. As people began to live in larger and larger settlements, trade became more complex, and the notation system evolved accordingly. She also believes that the more complex token system was driven by the development of an elite within society. ( The implication is that they were managing resources for the larger population.) It is thought that from these clay tokens the first writing - cuneiform - developed. Data and the need to record it have had epic consequences.

One of the problems with bullae (and a lot of data sets we have now) is that you need metadata with them. Metadata is the data that describes the data - i.e., that 10 bullae means 10 cows. If you lose either the data or the metadata you have a problem. Matching a dataset to its correct metadata can still be an issue. Because of the way we have to work, we dissociate data from metadata, and that dissociation can be the cause of errors. For the Babylonians the solution was to develop a way of merging the bullae and the metadata to form a single document, getting rid of the bullae as distinct physical items over time.

We are evolving and recovering from our own era of bullae, the cards and columns that have been data recording’s fundamental mechanism for decades. That isn’t to say the concepts involved with cards and columns were not a revolution in their time, but that time has passed. In fact bullae were better than cards and columns - at least they could have a different shape! With cards and columns we again had the problem of dissociation; that is, that the metadata and paradata (information about how data was collected) was not an integral part of the data. We need to get them all closer to each other to make data management easier, quicker and less prone to errors.

Freed from the restraints

Developing a link to metadata is only one way that new data representations can help with market research. Being freed from the monodimensional restraints of cards and columns has other advantages.

Data is more than numeric information. One example is the concept of global codes across a questionnaire. A global code is a way of denoting that any question has a value such as “don’t know/no answer” or “not asked” or even “missing” and having the same value represent that across all questions. If you use the card-and-column method of holding your data you may have a question which has 27 possible responses; usually other responses will be allocated a number such as 99 or 98 to “don’t know/no answer.” Having this discontinuous code range can cause problems. In many systems this will mean that you have to allocate space to hold 99 or 98 possible codes, even though there is no chance of the codes above 27 and below 99 or 98 will be used. This preallocation of space to hold possible values leads to what is termed the “sparse matrix” problem. You can end up with a potentially large data file which is mostly filled with space and very little real data.

If you have a question with 122 possible answers you will have to pick a value of 123 or 199 or 999 for “don’t know/no answer.” The point is that the value changes according to the code list or type of the question. We can’t have the same “don’t know/no answer” code for questions with 27 potential responses and questions with 127 responses and be efficient in our data storage.

It’s far easier to have some non-numeric character to represent “don’t know/no answer” across EVERY question. You can even start to represent really useful things like “not asked in this wave” or “not asked in this country” as some software does. Missing values can be represented as missing consistently across data sets. You can only do this if you don’t use a strictly numeric way of representing your data.

Not flexible enough

The whole punched card thing is actually quite complex (see www.maxmon.com/punch1.htm for some history). But today we have vestigial cards and columns which are just not flexible enough. Some software uses global codes, some doesn’t. It’s usually the programs that are still tied in some way to the card-and-column concept that don’t.

It’s not just global codes that are useful. Hierarchical data can also benefit from non-numeric data items. Data hierarchies are those such as credit cards owned by people within a household, or items bought on a shopping trip. They can get quite complicated, with several “levels” to the hierarchy. Again it becomes a question of knowing more about your data. When you have a variable X it would often be very useful to be able to see that, as well as the value of X, there is some information that encodes the structure of the hierarchy within the data value that this value is derived from. One of the reasons for this is the problem of performing a mismatched analysis. With many software systems you can crosstabulate variables that are on a different level of the hierarchy and produce garbage - embarrassing garbage in at least one case that I know of. If the hierarchical structures are contained within the data this risk is reduced considerably; the analysis software can check that the data variables are at the right level. Card-and-column data simply does not have the richness to encode all the dimensions of data along with metadata about the structure of the data. We have database technologies which can organize data to a degree and provide this structural component, but encoding the data with information such as “question not asked” or “question skipped over due to program control” is a more complex problem.

Optimal way

Even when we have our dataset with non-numeric information there is still the question of organizing the data in an optimal way for analysis. Crosstabulation is still the dominant way of analyzing quantitative data in market research. Crosstabulation involves a “linear” pass through the data; each record is examined to see if it should be included in the crosstabulation. Retrieving data from a relational database for crosstabulation is very inefficient. Using a relational database to organize data when it is being collected and edited is a very efficient practice. So it is now common to invert data for analysis. That is, instead of storing the data in a respondent-ordered way it is inverted so all the responses for a question are stored together. You end up with a much more efficient data structure for analysis. This is an important factor in the current data climate. We are increasingly having access to huge amounts of data, and we need ways of representing them optimally for analysis. While there are many new, and very interesting, analytic tools coming into the marketplace, crosstabulation will be around for quite some time to come.

Change the nature

Because of tools such as XML we now have the potential to change the nature of the data we collect and allow it to become richer and more “dimensional.” But we aren’t yet at the ideal point. We do need to make sure that the potential for dissociation of data and metadata is minimized; what is a data point should be intimately encoded with its metadata. Increasing the dimensions of the data, with global codes for “no answer,” codes for “not present in this wave,” or “not asked” will make the task of analysis simpler and more efficient. We do have to face up to the fact that cards and columns can no longer provide us with the forms of data representations we need. It’s time to evolve!