Five Steps to Tackling Big Data with Natural Language Processing

Print Friendly, PDF & Email

In this special guest feature, Paul Nelson, Chief Architect at Search Technologies, discusses his top 5 essential steps for tackling a Big Data project using Natural Language Processing (NLP) and also how NLP tools and techniques help businesses process, analyze, and understand all of this data in order to operate effectively and proactively. Paul was an early pioneer in the field of text retrieval and has worked on search engines for over 25 years. As the Chief Architect at Search Technologies, Paul provides architectural oversight for clients’ projects and conducts design, technology research and training. He was the architect and inventor of RetrievalWare, a ground-breaking natural-language based statistical text search engine which he started in 1989 and grew to $50 million in annual sales worldwide. RetrievalWare is now owned by Microsoft Corporation. During his many years in the industry, Paul has been involved in hundreds of text search installations of all shapes and sizes. This includes enterprise search for dozens of fortune 500 corporations as well as large government installations for the National Archives and Records Administration (NARA) and the Government Publishing Office (GPO).

Natural Language Processing (NLP) is fast becoming essential to many new business functions, from chatbots and digital assistants like Alexa, Siri, and Google Home, to compliance monitoring, BI, and analytics. Consider all the unstructured and semi-structured content that can bring significant insights – queries, email communications, social media, videos, customer reviews, support requests, etc. NLP tools and techniques help businesses process, analyze, and understand all of this data in order to operate effectively and proactively.

But, how do you even get started and what steps should you follow?

I’ve had the opportunity to be a part of some exciting projects recently to help enterprises do more with big data by using emerging technologies such as NLP.  I know it can be overwhelming if you are just getting started with NLP, so I will share some of my best practices with this list of five essential steps you can follow to ensure a successful project.

STEP 1:  Cover basic processing

Before you get started, it’s important to understand that in most cases, your content with the most important information is written down in a natural language such as English, Spanish, etc., and it is not conveniently tagged. Therefore, to extract information from this content you will need to do some level of text mining, text extraction, or full-up natural language processing.

The input to natural language processing will be a simple stream of Unicode characters (typically UTF‑8), and basic processing will be required to convert this character stream into words, phrases, and syntactic markers which can then be used to better understand the content. Basic processing includes language identification, sentence detection, lemmatization, decompounding, structure extraction, tokenization, entity and phrase extraction. There are a wide range of open source and commercial text analytics and NLP tools that can help you do these tasks.

STEP 2:  Identify level of understanding and evaluate feasibility

Next, you should decide what level of content understanding is required – macro vs. micro. While micro understanding (extracts understanding from individual phrases or sentences) generally contributes to macro understanding (provides a general understanding of the document as a whole), the two can be entirely different. For example, a résumé may identify a person, overall, as a Biologist [Macro Understanding] but it can also identify them as being fluent in German [Micro Understanding].

And, while deciding the level of understanding, you should also be evaluating the project feasibility as not all NLP understanding projects are possible within a reasonable cost and time.  Ask questions like: What are the accuracy requirements?  Can you afford the time and effort? Is the text short or long? And, is a human involved?

If you decide it’s feasible to move forward then it’s time to extract the content.

STEP 3:  Extract content for macro and/or micro understanding

Once you have decided to embark on your NLP project, you will require a more holistic understanding of the document, so this is when “macro understanding” comes into play.  It is useful for doing things like:

  • Classifying / categorizing / organizing records
  • Clustering records
  • Extracting topics
  • Keyword / key phrase extraction
  • Duplicate and near-duplicate detection
  • Semantic search

If you need to understand individual words and phrases, then you’ll turn to micro understanding for the extracting of individual entities, facts or relationships from the text. This is useful for doing things like:

  • Extracting acronyms and their definitions
  • Extracting key entities like people, company, product, location, dates, etc.

Remember that micro understanding must be done with syntactic analysis of the text – this means that order and word usage are important.

STEP 4:  Maintain traceability

Acquiring content from multiple sources and then extracting information from that content will likely involve many steps and a large number of computational stages. This is why it’s vital to provide traceability for all outputs. You can then trace back through the system to identify exactly how that information came to be, supporting quality analysis and validation purposes.

You will want to note things like:

  • The original web pages which provided the content
  • The start and end character positions of all blocks of extracted text
  • The start and end character positions for all entities, plus the entity IDs
  • Cleansing or normalization functions applied / used by all content

STEP 5:  Incorporate human feedback

Content understanding can never be complete without some human intervention. You need a human to discover new patterns and for creating, cleansing or choosing lists of known entities, to name a few.

For example, you may want to leverage crowd-sourcing to scale out human-aided processes and also find ways to incorporate human review as part of your standard business process (i.e. form fills etc.).

Many of these processes can be mind-numbingly repetitive. In a large-scale system, you will need to consider the human element and build that into your NLP system architecture.

Just be aware that continuously doing quality analysis of the data during each step of the process is key to getting the best understanding of natural language content. The whole process may seem daunting, but using these steps and techniques as a guide can help you create a working, robust system for acquiring, harvesting, and turning unstructured big data into practical, insightful knowledge that advances your use case.


Sign up for the free insideBIGDATA newsletter.

Speak Your Mind