Unstructured data? Categorization using text analytics

Listen to this article

Editor’s note: Brion Scheidel is director, text analytics at marketing research firm MartizCX, Detroit. This is an edited version of a post that originally appeared here under the title, “Making sense of unstructured data with text analytics.”

The amount of data in the world is increasing at an exponential rate. Every minute there are nearly 4.2 million posts uploaded to Facebook, nearly 3 million tweets and thousands of responses to open-ended survey questions.

I’m often asked about all this unstructured data and how companies can make sense of it. The short answer: categorization using text analytics. Inevitably, the follow-up question is, “What’s the best approach to implementing categorization using text analytics?” I’m going to dive deeper into answering these two questions. My hope is that you’ll be able to see how text analytics can help you make sense of your company’s data.

Categorization

How do companies make sense of all this unstructured data? One main way is categorization using text analytics. Categorization is the process of defining a set of categories and then setting up a system to assign those categories to comments. Categorization also typically involves deciding what sentiment to associate with each category assignment (e.g., negative, neutral, positive). Once we have assigned categories and associated sentiments, we can start quantifying the results and displaying them in charts and reports alongside satisfaction and recommendation scores.

An airline, for example, might have a category set that would include categories such as:

food quality;
food choice;
food presentation;
food freshness;
food general;
special meals;
drinks;
alcoholic drinks;
coffee; and
tea.

And that’s just the set of food and beverage categories. They would probably also have categories related to their customers’ experience at the airport, experience on the flight and experience with customer service. With categories and sentiment assigned, we can start analyzing charts and querying data to answer specific questions: Which categories have the most assignments? Which categories have the worst sentiment? Which categories have downward-trending sentiment?

What’s the best approach to implementing categorization using text analytics?

Let’s look at the “how” and “who” when implementing categorization.

There are two main options for the methodology (the “how”) behind categorization: machine learning and rules-based.

Machine learning: Systems like IBM Watson use machine learning to do automated categorization. This means one must first define a set of categories, manually assign those categories to a training set of comments and then feed that information to the machine-learning algorithm. If all goes well, the machine-learning algorithm learns the correct things from the comments it was trained with. Using the information learned from the training set, the machine learning based system will then make category assignments to comments processed with it. One benefit of using this method of categorization is that no special skills are required. Anyone can set it up. One drawback to machine learning is that it is difficult to get good results with larger category sets (i.e., category sets with more than a handful of categories).

Rules-based: Another way to perform automated categorization is through a rules-based approach. Using a combination of linguistic and logic skills, text analysts define a set of categories and manually create rules for those categories. These rules combine the experience and expertise of the text analysts with the natural language processing functions and processing power provided by a rules-based text analytics engine. Using those rules, the engine will then make category assignments to comments processed. This method allows you to get good results, even with larger category sets.

Manual categorization: There’s a third option called manual categorization which can be used instead of machine learning or rules, or in addition to them. This is the old-school approach of having people read each comment and decide which categories the comment should be assigned to, and with what sentiment. While admittedly low-tech, this is a reasonable, cost-effective approach for small volumes and less common languages.

Implement and maintain

Once you’ve decided on whether to use machine learning, rules based, manual categorization, or even some combination, the next question to answer is, “Who is going to implement and maintain this?” Your choices are essentially: Pay someone else to do it (service-based) or do it yourself.

Service-based: With a service-based approach, you select which text analytics provider best suits your needs and you pay them to categorize your comments. With this approach, text analysts leverage years of experience on your behalf. While they do the heavy lifting, you can concentrate on analyzing and interpreting the text analytic results. Should you need to add a new category or tweak the logic or training behind a category, however, you need to rely on your technical assistance (TA) provider for updates. This can take time if your TA provider isn’t responsive. This is one reason some companies choose to do text analytics themselves.

Do-it-yourself: With a DIY approach, you select which text analytics tool best suits your needs. Often the tool will come with an off-the-shelf set of categories for your sector but this will typically get you only 70 percent of what you need. To get good results with a DIY approach, you need to be prepared to invest time and effort to configure the category and sentiment algorithms and maintain them over time. Here is a rough idea of what you’ll invest:

Text analytics software: $25,000 to $200,000 annually
Rules-based:
- Text analysts to implement category set (200 to 300 hours per language)
- Text analysts to audit and maintain category set (100 to 200 hours annually per language)
Machine learning based:
- Manual coders to create training sets (about $1.00 per comment – typically needing at least 100 comments per category)

If you decide to DIY, you may wind up spending most of your time creating text analytic results rather than using them. While this may be a deal breaker for some, you will have full control of the process and can fine tune to your heart’s content.

Solution for your company

Having a text analytics solution for your company is crucial. It will increase the speed at which you gain customer insights. Before setting up text analytics, companies first need to decide if they want to use a machine-learning approach or a rules-based approach. They then need to decide on a DIY or service-based implementation.

In my experience, the level of machine learning precision is not acceptable compared to the results we obtain with a rules-based approach. By developing and maintaining category sets with input from our customers, we ensure they are relevant and actionable. These category sets are also based on decades of experience in the various business sectors in which we work (automotive, retail, hospitality, banking, insurance, restaurant, telecommunications, etc.).

It’s important to do your own research before deciding. Many companies I work with have opted for a service-based approach to categorization but some have chosen to invest the time and money to develop in-house text analytics solutions and expertise. And some even have a combination of the two. Most importantly, almost all have concluded that categorizing their comments with text analytics is a key way to understand what their customers are saying.

Deciding which how and who are the first steps in putting into place a solution that will help you make sense of the mounds of unstructured data that are flooding into your company.