There is much to unpack in the world of Natural Language Processing. We sat down with Senior Data Scientist Alex Tennant for his perspective on the opportunities and complexities of NLP and how Mosaic is paving the way for the consistent evolution of this powerful technology.
1. What is your official title, role and responsibilities at Mosaic Data Science?
As a Senior Data Scientist for Mosaic, Tennant acts as a consultant for businesses large and small, using machine learning to optimize processes and solve their most challenging data science problems. You can find him managing several different projects and helping to push them through various stages of the project, from its scoping and inception to launching the solution in production.
In a nutshell, Tennant applies statistical or machine learning models and mathematical optimization to improve how something operates or to help understand why it may not work as it should. Using a combination of these techniques, he helps deliver solutions that enrich Mosaic’s customers’ understanding of their own processes and data while also finding effective solutions to their analytics problems.
2. How did you first come across NLP in your data science journey? What was your first project?
Tennant initially dabbled in Natural Language Processing during his first job out of graduate school in Canada. He worked on a project which involved analyzing public and corporate opinion on an upcoming policy to be implemented from formal written responses.
He was challenged with looking at these responses to see how people, businesses, and non-profits felt about the proposed change, which meant sifting through a few thousand documents. Using NLP techniques, Tennant could efficiently analyze these documents, categorizing and summarizing them to compare and contrast opinions of the proposed change between different groups.
3. What is natural language processing (NLP)? What are the different techniques and most promising use cases associated with this technology?
“I think it’s a vague term that really depends on the application,” Tennant explains. “A word cloud can be considered NLP, where you use count statistics to get a bird’s eye view of what people are talking about. But you can also go a lot deeper into it, trying to understand what is going on within the word cloud by analyzing documents, clustering everything into different groups to determine if they represent something important.”
He also points to NLP’s latest cutting-edge technologies, such as transformer models. Data scientists can train a machine learning model to read and understand a document. This application works very well with document summaries, which can help us understand the key points of many documents without needing to read each piece of text while keeping meticulous notes.
“But where NLP gets really cool is with things like named entity recognition, specifically applied in the medical field,” Tennant adds. “You can train a neural network to recognize certain words or concepts in context, categorize them, and automatically extract different symptoms or diagnoses used to describe the identified words or concepts. This extracted information can be leveraged to produce statistical models, which can be used to help diagnose a patient. Essentially, you’re teaching the computer to read medical notes.”
4. How do you think Natural Language Processing differs from other deep learning/ML applications? How do they tie together?
Tennant explains that NLP techniques have the ability to benefit other deep learning applications, specifically in time-series analysis, due to their ability to make predictions from existing data by identifying patterns.
“If you take a step back, what you’re trying to understand with NLP is sequences or the meaning behind the order of a list of things. NLP takes that ordered list of things and makes models and predictions to understand it,” he says. “If you look at this as more of an arbitrary collection of ordered things, then that sequence could be considered a time series analysis problem. This refers to events that happen in order, where the NLP model makes its next prediction based on what has already been observed. Advances in NLP can be applied to any sequential data, which can be tied into other deep learning applications outside of NLP.”
When looking at the differences between NLP and other technologies, Tennant points to the fact that such sequences are very understandable to humans. We can read it and make sense of it; therefore, we can figure out how to structure it. However, what is simple for a human, can often be difficult and complex for a computer.
When speaking in terms of a computer, NLP all text data is unstructured, meaning it is not easily understandable by the computer – information is not stored in a consistent format like rows and columns – it is all buried within a maze of words and phrases. To a computer, this is considered messy data. This is where NLP differs from other technologies like machine learning, which rely on more structured data that is made for the computer to digest and analyze.
5. Why do you think NLP problems are so important to solve? What do you believe are the benefits and limitations of NLP?
To answer the first question, Tennant points back to the fact that Natural Language Processing deals with unstructured data.
“Most data is unstructured,” he explains. “Using NLP techniques allows us to find new and exciting ways to take unstructured data and do some modeling with it without having to do days to weeks to years of work to bring some structure to what we’ve seen.”
When looking at the benefits of NLP, Tennant believes this lies in the human aspect of data science and machine learning.
“Computers don’t communicate through text or context,” he says. “Humans don’t log everything they want to communicate in a spreadsheet. We can communicate more information with text than we would in a spreadsheet. NLP helps tap into the information that lies in text data to pull more insights than ever before.”
To support this, Tennant uses the example of going through medical notes. A patient with a long history of treatment likely has hundreds of notes that a new doctor or someone doing a study must go through manually. NLP can help people understand this more quickly, moving research and treatment along more efficiently. This is done by summarizing text into different categories or ideas.
“NLP can also be used as a supplementary piece to traditional modeling,” Tennant adds. “Without NLP, a classical approach would be for data scientists to look for words that appear together frequently in a document and hope those counting statistics can be informative for creating a predictive model. However, one of the main problems with this approach is that all context in these documents is washed away. Using NLP, we can keep and model important context, delivering a richer background that brings actual meaning to the data.”
The biggest limitation Tennant has seen with Natural Language Processing is its tendency to magnify biases contained within any text used for training. In addition, as language changes over time, NLP models may not be up-to-date on the meaning of a word such as “gay” today compared to how it was used in the past. This sort of context, which isn’t necessarily written down in the document, can become an obstacle for NLP.
6. Why is Mosaic so well-positioned to help with NLP use cases? What sets the team/ or company apart?
“Mosaic has ample experience in NLP, and our team consists of individuals who have worked in this space for years,” Tennant says. “The biggest benefit Mosaic brings to the table is our background in having seen and used these techniques, such as sequence analysis, outside of NLP for other use cases.”
The techniques used for NLP are not limited to NLP, and Mosaic Data Science has a keen eye for identifying new applications for advances in NLP in other domains. Mosaic’s team has consistently been able to take NLP techniques out of their original context, modify them, and apply them to solve other classes of problem – a skill that is invaluable for problems that cannot be solved with out-of-the-box solutions.
7. Where do you see the future of NLP going? How does Mosaic play into this?
“The cutting-edge NLP models being used today can improve search results online by understanding the context of questions that are being asked. This means that your Google searches could get more accurate,” Tennant says.
He also hopes for a future where NLP acts as an adversarial model to get people out of their “tailored search results traps.” Tennant notes it’s very easy to get trapped in an online search, where the user isn’t coming across anything new or different from what they’ve been searching for.
“The results you’re getting on Google are too tailored,” he says. “NLP can analyze how your search bar questions are being asked. Rather than reinforcing your biases, an adversarial NLP model can get you out of that trap and give you access to a wealth of fresh information to aid your search.”
And how does Mosaic play into this? The answer, in Tennant’s perspective, is two-fold. The Mosaic team can offer a fresh opinion and brainstorm a new solution for a client seeking an outside perspective. On the other hand, Mosaic can bring the manpower and technical expertise needed to solve a problem from scoping to production. Either way, the team’s strong background in Natural Language Processing and breadth of experience in projects across various industries make them great partners for even the most complex data science project.