The basics of NLP

Editor's note: Briana Brownell is founder and CEO of Pure Strategy, a Canada-based marketing research and analytics firm. 

“Hey Siri, what’s natural language processing?”

She hears me perfectly and, after less than a second, she pulls up a Wikipedia article on natural language processing. I skim it. It tells me that natural language processing, usually abbreviated NLP, tells us how to program computers to process and analyze large amounts of natural language data. Yes, Siri, if only it were that easy.

Without realizing it, most of us are already using technology powered by natural language processing every day. It’s behind the page rank in Google’s search algorithm, when we ask Amazon’s Alexa to add paper towels to our shopping list and when we talk with a chatbot to dispute a fraudulent charge on our credit card or to add data to our cell phone plan. Our meetings with NLP-enabled technology are not always so helpful. Consider Tay, the Twitter bot that became racist and spouted conspiracy theories in just a few hours or a chatty new contact on Skype that tries to lure you to her fraudulent Web site.

Technology using NLP is moving quickly, making many new applications possible. But this speed means that it’s a challenge for many organizations to know how to take full advantage of it. Social media, voice search and conversational systems have fundamentally changed the way that customers interact with brands. But even further, it has given companies a more holistic view of their customers. We have come a long way from information retrieval systems, the most common early technology that relied on NLP, which had little to no effect on customer-facing businesses. Typically tucked away in the IT department it performed tasks like document retrieval or database storage but did not directly affect meaningful business metrics.

Within the past few years, the volume of data coming into organizations in the form of natural language directly from customers has skyrocketed and for many organizations has become unmanageable, despite is potential to be a powerhouse of insights. For this reason, NLP is having a major impact behind the scenes quickly compiling open-ends in market research surveys, prioritizing complaints that come into a help desk, tracking employee morale or even helping sales personnel close an important client. Increased availability of data sets for use in training natural language processing systems, improvement in the science behind the scenes and faster processors allows a whole new array of possibilities to understand, monitor and ultimately improve customer experience.

However, language remains extremely challenging to analyze and the science behind analyzing it effectively is far from perfect. Humans spend much of their lives immersed in language, so we do a lot of complex things without thinking much about how complex it really is. The technology to understand language has a noble history: the Turing test, the first test designed to decide whether a machine was truly intelligent, was predicated on communication with the AI in the form of a conversation where the user could ask the AI questions. Technologists have been working on problems in NLP ever since. 

Even though the technology is moving quickly, some problems are much more challenging than others and having a basic understanding of the technology can allow businesses to evaluate use cases and apply it successfully in their organization.

Word tokenization

The idea of tokenization, or splitting a text input into smaller pieces, also had its foundation in information retrieval. A common challenge was to search for a term in many documents and identify where it occurred. It’s not so different from sorting through customer comments to find mentions of a competitor or a specific product.

Word tokenization is the base of word clouds, which, despite their apparent simplicity, can provide a fair amount of insight into contents of a large amount of text data. The most common individual words give us a broad sense of the contents of the data. 

The word tokenization procedure splits each comment, paragraph or document into a list of words, usually by considering anything with a space or punctuation before or after as a “token.” Because it uses characters to split the input text, it is both consistent and effective on most input data streams. We can improve the visualization by keeping only important words, stripping out common words like “the” or “and.” Deciding which words to remove or keep is highly subjective, as a word which has meaning in the context of one dataset might not be relevant in the context of another.

Since word tokenization uses spaces to decide what elements to split up, compound words and hyphenated words and phrases are challenging to handle. In a survey about shopping habits, we may wish to know the relative frequency of “store” in both “grocery store” and “bookstore.” But since bookstore is a single compound word, most word tokenization will treat it as a single token. On the flipside, phrases that are made up of multiple words sometimes should be treated as a single entity, such as “soap opera,” which does not have the same meaning when we split it into two tokens. These inconsistencies in language makes consistency even in the most basic processing a challenge.

Algorithms that link phrases automatically and identify compound and hyphenated words that should be split add additional errors by sometimes separating or linking when they should not. Most key-phrase algorithms work under the assumption that the grammar of the underlying sentence will be important to knowing whether the words should be linked. When there is no underlying sentence, it has no way of knowing whether a series of words is a phrase or not. For this reason, poor grammar and incomplete sentences have an outsized effect on whether key phrases can be accurately linked, so open-ends where respondents give point-form answers usually don’t work very well. Full-text reviews or text from e-mails tend to be much more successful. 

Sentence patterns

Analyzing the open-ended responses in a customer survey gives a level of understanding that is impossible to get from strictly quantitative measures. It allows us to capture the “why” behind satisfaction ratings and provides more insight on how purchase decisions are made. But aggregating this data while still attempting to capture these nuances is challenging. Simply counting words is not enough. For this reason, some rules-based NLP looks for sentence patterns to extract useful information.

In the first “chatterbots” developed in the 1960s, the algorithms looked for specific keywords in the text in a certain order and then responded based on stiff rules. One of the most famous examples was Eliza, who was supposed to emulate a Rogerian psychiatrist. It carried on a conversation with a user – assumed to be a patient – by matching the pattern of a sentence to one in its code and producing a response or asking a new question. Surprisingly, even in its simplicity, Eliza was very popular and many claimed conversations with her were helpful.

An algorithm built using very rigid, logic-based rules without any room for subjectivity or soft meaning must work a lot harder to decode the information contained in a customer comment, especially in aggregate. Customer comments frequently are incomplete sentences and may be full of spelling mistakes. For example, if a customer said, “I don’t like the color selection” we could automatically pick up all the text after “I don’t like” and store it in a variable field. But we will want to capture the “i dont like the color selection” or any variety of other errors. A human reviewer would immediately understand this despite the spelling errors. To take errors like this into account, we must keep increasing our set of rules in order to make it comprehensive. All this additional complexity makes for a system that is really time-consuming and difficult to update as markets change and new products or concepts start to appear in the data streams. It also means that manual work is still often more accurate.

Synonyms and multiple word sense

Usually when we have a large amount of text data we are looking for similar themes among many comments – for instance, common suggestions to improve or reasons for satisfaction – because this allows us to prioritize changes that will most impact customer experience. When we simply look at individual words without an outside mechanism we have no way of knowing whether the words are synonyms.

Most coding frameworks are created by researchers to solve this problem but it means that much of work is done manually. As new comments come in, the coding frame needs to be expanded. External frameworks for synonyms can be used but they are frequently too general to be effective. Most words have many different senses, depending on their context and use. Human coders easily understand this but for a computer it is much more difficult. Consider “interest rate.” Would we want to combine the word “interest” with its common synonyms “amuse,” “entertain” and “intrigue”?

Sentiment analysis

Sentiment analysis is one of the most common basic text analysis procedures that is used on text data but, unfortunately, it’s probably also the most complained-about. Usually, sentiment analysis classifies text as positive, negative or neutral and sometimes assigns a numerical score to the comment so that responses can be ranked. This may be important when prioritizing many comments, looking for detractors or advocates or simply understanding the public perception of a company overall.

There are many sentiment analysis classifiers which work one of two main ways: phrase matching or learned sentiment classification. Phrase matching has a list of very positive, positive, negative and very negative words that it looks for and finds and then creates a score accordingly. For example, a sentence containing “hate” is likely to be very negative, regardless of its context. The second sentiment analysis method, learned sentiment classification, takes a dataset which is trained on data that has been tagged by human specialists and decides how negative each word or phrase is based on what it learned from this initial tagging. A learning classifier might pick up on a word which just happens to appear frequently in the training dataset’s negative comments that has no real bearing on the sentiment. This results in overfitting and biases the output whenever that word appears.

This is usually the challenge with off-the-shelf sentiment analysis which may or may not be applicable to the type of data you’re dealing with. A comment like “Mortgage interest rates are too high” might be a negative comment in the context of a survey about banking but it’s unlikely that it will be found as such in a general sentiment analysis system.

Word-embedding methods

Word-embedding methods, which attempt to solve some of the challenges in natural language processing, have been around since the 1960s but only recently have these techniques started to be widely used in commercial NLP applications.

Working off the assumption that a word is characterized by the company it keeps, word-embedding assigns each word a numerical vector based on the words around it, either in a linguistic context or simply the proximity to other words. In this way, it uses contextualization to closely group words that are found in similar contexts. This allows us to develop a more nuanced understanding of the way in which a given word is normally used. 

These systems have the potential to be very effective since they can be used to understand the meaning of the word in context. From the resulting vector, we can figure out how closely synonymous two words are and what are their closest synonyms in the dataset. We can extend it to take into account that words may have multiple meanings.

Word-embedding has allowed machines to build up a much more refined idea of what words mean so classifying and grouping comments is easier but even this process has drawbacks. Since word-embedding works with the idea of replacement rather than how words may complement one another, words which can be interchanged but have very different meanings are frequently found embedded close to one another. For instance, considering the phrase “expensive pizza,” the adjective “expensive” would find closer neighbors with “overpriced” than “delicious” but both would still be much closer than a completely different type of word, like “popsicle.” The word “expensive” is not likely to ever find itself in a sentence where it could be a replacement for “popsicle.” 

This gives an interesting problem, where words which may suggest the same topic may not actually embed closer together than words that mean something quite different. For data about restaurant satisfaction, “food” and “service” both end up close together since in the comments left by customers, a great many have simple comments like “great food” and “good service.” Meanwhile other words related to service like “server” embed farther away because “service” and “server” aren’t interchangeable in many sentences, even though we know that they are related to the same topic.

Although word-embedding functions well, the volume of data needed to create accurate results is daunting – typically within the millions of rows. The more we want to add in the nuance of understanding the individual words, like considering various word senses, the more data we need to be able to take it into account. Most businesses simply don’t have enough data to be able to effectively use these methods. Efforts are even being made to start crowdsourcing NLP development, with companies in the space using services like Amazon’s Mechanical Turk to build large, collaborative datasets so that people can help machines learn, which could help to push the technology to new horizons.

Capture a richness

There’s no doubt that the advances in NLP are fundamentally changing how customers interact with brands and expanding the depth of insights that companies can draw from their customers. Understanding the technology behind it, what it can and can’t do, makes insight-generation tractable for those in the organization who are tasked with using it. Whether it is a straightforward word cloud or a neural network which embeds tens of thousands of words to get a deeper overview of attitudes about a topic, working with language data allows companies to capture a richness that augments quantitative measures. The science driving these tools continues to improve as the underlying methodology is developed and tested in real applications. New, targeted datasets are becoming available and some of the challenges in using more sophisticated approaches to language will disappear.

Quickly finding insights in large volumes of customer comments is empowering companies to provide better customer experience, mitigate risks and enter new markets confidently. It allows companies to hear directly from the customer in their own words. Sometimes the most surprising insights come from this data. NLP technology means that game-changing insight in a mountain of data is finally possible to find.