Pulling Insights from Unstructured Data – Nine Key Steps

Print Friendly, PDF & Email

Salil_GodikaIn this special guest feature, Salil Godika of Happiest Minds Technologies discusses the value of unstructured data and identifies 9 key steps for how companies can work with it effectively. Salil Godika is Co-Founder, Chief Strategy & Marketing Officer at Happiest Minds Technologies. Happiest Minds ranked second fastest growing technology company on Deloitte Technology Fast 50 India 2014.

In the era of “Big Data”, companies are flooded with information from a variety of sources. Most of this information is structured, meaning it can be easily categorized, sorted, and filtered. However, significant insights can be found in what’s known as “unstructured data”, for example by reviewing content within social media posts, mobile devices, even customer phone calls.

Collecting and analyzing this data can be tedious and it does require IT resources, but the payoff can be immense. Businesses that want to stay agile and compete can use unstructured data to see hidden correlations between different departments or customer groups.

How should companies go about working with unstructured data? Here’s nine core steps that detail each stage, from collection to the end result.

  1. Narrow down the data. Before getting into the “how” of unstructured data analysis, you have to square away the “what”. Perform a review of the different sources to be sure you cover a wide and relevant range of information. If a source of data is obviously not related to your goal, then remove it, but don’t cut too much so you lose the chance of finding hidden context.
  1. Consider the intended result. Once you have a broad data set, then you must consider what types of insights you hope to gain. This doesn’t need to be too narrow, such as “customers in Des Moines tend to purchase produce on Wednesdays”, but it should give you some focus.
  1. Pick the stack. The analysis results need to be moved to a cloud-connected information store or technology stack in order for them to be properly utilized. Consider the scale, variety, and volume of the data in question before picking the right data storage solution. You need real-time access and powerful analytics engines which can be found with solutions such as Lambda, Flume, and Storm.
  1. Throw it in a lake. Companies are often used to slicing up there data and storing it in data warehouses in a neat and sanitized form. The new approach is to store information in a “data lake”, a place to keep it in its native form where you preserve metadata and other information that might prove useful under deeper analysis.
  1. Do some cleaning. Using a data lake does not mean you can’t pull some of the data and refine it. If there are symbols or extra whitespace in a file, then it should definitely be cleaned so it’s more manageable. You also want to remove dupes and treat for missing values, as long as the data lake remains untouched.
  1. Pull the useful stuff. Techniques such as semantic analysis and natural language processing are invaluable for pulling out data from speech. Focus on words that tell you location, product, or name, in order to develop relevant and sortable information. You can use term frequency matrices to spot phrasing patterns and trends.
  1. Build the ontology. Correlations and hidden insights are often discovered in this stage, where you establish relationships between sources with the intent of building a more structured database.
  1. Modeling and execution. Techniques such as logistic regression, Naïve Bayes, and support vector machine algorithms can establish similarities in how customers are acting or aid in classifying various documents. Mixing in sentiment analysis gives a company data on how customers are feeling, which can be compared purchasing trends.
  1. Take action based on the results. Naturally, the final step in the process should be a review of the end results. Massive amounts of unstructured data should be boiled down to simple insights which managers can view on their tablet or phone. This helps them to make decisions in real time, and ensures the core result is easily viewed and understood. As always, the best insights that come from unstructured data sources will be actionable in some way.

Social media posts and machine logs are so integrated into a company’s day-to-day operations that they must now be considered data sources. Data scientists working on structured data analysis will need to develop some new skills to properly blend structured and unstructured data together to give firms the intelligence needed to adapt and grow.


Sign up for the free insideBIGDATA newsletter.

Speak Your Mind