Machine Learning – The Engine Behind Big Data Processing

Print Friendly, PDF & Email

SumoLogic_SanjayIn this special guest feature,  Sanjay Sarathy of Sumo Logic discusses the three waves of evolution of big data, how machine learning plays a big role in this metamorphosis, and also what’s in store next. Sarathy is CMO at next gen machine data intelligence company Sumo Logic.

These days, you can’t talk much about technology without talking about data. Though “Big Data” is still making its way into enterprise processes – there is already much to learn from its evolution to date. Thus far, I’ve observed three waves of the evolution of Big Data. Given the astounding rate at which technology progresses, I would imagine there will be many more waves. But, I’ll first reflect on the transformation to date, and then take the liberty to identify what is next.  Given the significant business benefits that data poses, looking back on how the technology has developed can help us get ahead of the next wave. As businesses work to leverage data to optimize business operations and drive new and existing revenue streams – comprehension of machine learning will prove to be a key piece to this puzzle.

Wave One 

In the first wave of Big Data, the focus was on how to access and collect data from different sources. IT administrators were aware of the potential benefit from the petabytes of data generated second-by-second, but it was unclear how to do it or what to do once you had it. Machine data is just one example of an enormous source of information that could offer hints into user behavior, under performing infrastructure or even security threats. But storage of this data was also a major barrier, especially as the volume of data generated rose astronomically.  Log data alone is expected to grow 15 times by 2020, according to a report by the research firm IDC.

Wave Two

As technology advanced, the second wave of Big Data emerged.  During this period, the primary focus became management of the data, including how to organize and analyze it. Companies were looking at terabytes (or even petabytes) of data and wondering how on earth they could comb through it to learn anything in a timely fashion.  Companies needed ways to correlate, monitor and alert on this data set so that IT and business groups alike could take advantage of hidden and interesting information nuggets, and this is what led to the growth of first generation log management and SIEM and even infrastructure monitoring tools.

Wave Three

Now we find ourselves in wave three, where the challenge is not only to analyze the data, but do it quickly and deliver as much tailored information as possible without additional personnel. Machine learning has finally hit its stride by helping to solve the challenges associated with rapidly obtaining relevant insights. Especially in the world of unstructured machine data, machine learning is making CIOs think again about what – and when – they can gather insights from their own infrastructure.

Today, organizations generate more data in 10 minutes than they did during the entire year of 2003. This presents a completely new set of challenges than those of waves one and two. There are two major problems with analyzing machine data today. Firstly, the move towards cloud computing is that this data gets generated across a variety of environments and thus requires tools to be equally adept at collecting from on-premises and cloud environments.  Secondly, both the volumes and variety of data have exponentially increased.  Unstructured data alone used to primarily be server and network data. Now organizations generate data from those, in addition to mobile, switches, sensors, custom applications and many other sources.  As a result, the amount of noise in the system overwhelms the relevant information that provides important insights. Furthermore, humans don’t always know what to ask the data in order to extract maximum benefit. This is the problem of the “unknown unknowns.”

Enter: machine learning. Setting up search queries for specific data works if the administrator knows what he or she is looking for, and thus knows what to ask the data. But machine learning can address the “unknown unknowns” by continually benchmarking and altering the way that the algorithms sift through the data. With machine learning built into algorithms or management tools, users can refine and improve results over time and turn the massive volumes of often-irrelevant data into a set of critical patterns and events. With personalization, users can uncover the insights most important to them by capturing user feedback and using it to shape the outputs on an ongoing basis. And by nature, the machine learning-backed tools become more robust over time, meaning the user can more easily cut through the noise to deliver meaningful analytics. When those insights can provide context around glitches or outages, the result to the bottom line can be significant.

The amount of data produced by businesses is only going to grow in the coming years. So the challenges observed in the third wave of Big Data are only going to become more significant, requiring new techniques and methods to aggregate and analyze the data.  Machine learning is integral to this continued evolution as it dynamically evolves and “learns” about the environment and data automatically. As a result, users no longer need to know what they are looking for in order to find it. This satisfies both operational and business requirements even as the amount of data spikes further as we enter the next wave of Big Data’s evolution.


Sign up for the free insideBIGDATA newsletter.

Speak Your Mind



  1. The capability of the upper level ontology in SUMO will help solve many issues associated with scaling OWL .
    Great stuff !