Solving Unstructured Data: NLP and Language Models as Part of the Enterprise AI Strategy 

Print Friendly, PDF & Email

In this special guest feature, Prabhod Sunkara, Co-founder and COO of nRoad, Inc., discusses how enterprises are increasingly relying on unstructured data for analytic, regulatory, and corporate decision-making purposes. nRoad is a purpose-built natural-language processing (NLP) platform for unstructured data in the financial services sector and the first company to declare a “War on Documents. Prior to nRoad, Prabhod held various leadership roles in product development, operations, and solution architecture. His passion for building and delivering outcome-driven AI solutions has successfully improved processes at large global financial firms such as Bank of America, Merrill Lynch, Morgan Stanley, and UBS.

Unstructured data, the deep, dark data that’s prevalent across the enterprise, but not always transparent or usable, continues to be a top business challenge. Data that lacks a predefined data model is typically considered unstructured data, including everything from text-heavy documents and websites to images, video files, chatbots. audio streams, and social media posts. Collectively, by most estimates, these types of data account for 80 to 90 percent or more of the overall digital data universe. 

Growth and Challenges of Unstructured Data

The volume of unstructured data is set to grow from 33 zettabytes in 2018 to 175 zettabytes, or 175 billion terabytes, by 2025, according to the latest figures from research firm ITC. Thankfully, there is an increased awareness of the explosion of unstructured data in enterprises. For example, a recent study showed that nearly 80 percent of financial services organizations are experiencing an influx of unstructured data. Furthermore, most of the participants in the same study indicated that 50 to 90 percent of their current data is unstructured.

Until recently, it hasn’t been possible for computers to understand this data. Now, enterprises are increasingly relying on unstructured data for analytic, regulatory, and corporate decision-making purposes. As unstructured data becomes more valuable to the enterprise, technology and data teams are racing towards upgrading their infrastructure to meet the growing cloud-based services and the sheer explosion of data internally and externally. 

At the same time, these teams are having active conversations around leveraging insights buried in unstructured data sources. The spectrum of use cases ranges from infusing operational efficiencies to proactively servicing the end customer. To that effect, CIOs and CDOs are actively evaluating or implementing solutions ranging from basic OCR Plus solutions to complex large language models coupled with machine or deep learning techniques.

Incorporating NLP and Language Models into Your Data Strategy

A considerable portion of the enterprise’s unstructured data is textual. This can vary from legal contracts, research documents, customer complaints using chatbots, and everything in between. So naturally, organizations are adopting Natural Language Processing (NLP) as part of their AI and digitization strategy.  

Over the past decade, there has been considerable research and advances in NLP. Most notably, the emergence of transformer models is allowing enterprises to move beyond simple keyword-based text analytics to more advanced sentiment and semantic analysis. While NLP will enable machines to quantify and understand text at its core, resolving ambiguity remains a significant challenge. One way to tackle ambiguity resolution is to incorporate domain knowledge and context into the respective language model(s). Leveraging fine-tuned models such as LegalBERT, SciBERT, FinBERT, etc., allows for a more streamlined starting point to specific use cases.

At the outset, fine-tuned models establish a strong base. However, similar to the larger models, such as BERT and GPT3, these models still fall short of meeting most companies’ business outcome needs. As a result, enterprises operating in multiple markets, regions, and languages should consider incorporating cross-domain language models, multilingual models, and/or transfer learning techniques to accommodate broader challenges.

While there continues to be research and development of more extensive and better language model architectures, there is no one-size-fits-all solution today. As a result, enterprises trying to build their language models can also fall short of the organization’s objectives. Other factors impacting an organization’s unstructured data strategies lack of annotated data, unavailability of training data, lack of organizational understanding in adopting such models, and the simple need to quickly develop and deploy a production-grade solution at an affordable computational cost as well as ROI realizations. 

How Enterprises Can Tackle Their Growing Unstructured Data Problem

Data and a technology strategy play a key role in a typical enterprise AI roadmap. Most organizations are able to plan and manage structured data effectively. However, unstructured data is where the real context and insights are buried, and organizations drown in this data. It behooves the CDO organization of an enterprise to take this data into account and intelligently plan to utilize this information.   

The biggest challenge often seen is the lack of organizational alignment of an enterprise’s AI strategy. While this isn’t directly related to ML and DL models, leadership alignment, a sound understanding of the data and outcomes, and a diverse team composition are critical for any AI strategy in an enterprise. A quantifiable, outcome-driven approach allows the teams to focus on the end goal versus hype-driven AI models. For example, GPT3 is a heavy language prediction model that is often not highly accurate. There have been instances where GPT3-based models have propagated misinformation, leading to public embarrassment of an organization’s brand.

Training and building deep learning solutions are often computationally expensive, and applications that need to apply NLP-driven techniques require computational and domain-rich resources. Hence, when starting an in-house AI team, organizations need to emphasize problem definition and measurable outcomes. In addition to problem definition, product teams must focus on data variability, complexity, and availability. These steps will help strategize an approach, identify the suitable models as a foundational layer, and establish a sound data governance and training function.

An alternative and cost-effective approach is choosing a  third-party partner or vendor to help jump-start your strategy. Vendor-based technology allows enterprises to take advantage of their best practices and implementation expertise in larger language models, and the vast experience they bring to the table based on other problem statements they have tackled.

Incorporating a strategy to manage the enterprise unstructured data problem and leveraging NLP techniques are becoming critical components of an organization’s data and technology strategy. Although RPA, OCR Plus, or basic statistical-based ML models will not solve the complete problem, incorporating deep learning methods should be a path forward.

Sign up for the free insideBIGDATA newsletter.

Join us on Twitter: @InsideBigData1 – https://twitter.com/InsideBigData1

Speak Your Mind

*