Machine Learning: How to Master IT Operations with Real-Time Big Data

Print Friendly, PDF & Email

Steve BurtonIn this special guest feature, Steve Burton of Moogsoft discusses the role of machine learning in IT event management. Steve is VP of Product Marketing at Moogsoft. Prior to joining Moogsoft, Steve was VP of Marketing at Glassdoor, and Director of Product Marketing at AppDynamics where he helped the company achieve market leadership in just three years.

Event management is all about IT teams holistically listening to alerts from their applications and infrastructure so they can detect service failures before users (or the business) complains. Back in the 90s – when applications and infrastructures were monolithic, static and relatively simple – the purpose of event managers was to collect and display events as they occurred in something called an “event console” (dashboard). Anomalies would often be detected using static threshold rules, or some kind of basic filtering that would be defined by IT Operations.

In the past 20 years, however, applications and infrastructure have changed vastly thru agile development, virtualization, cloud, big data, mobile and SDN. Simply put, the volume of events and alerts from applications and infrastructure has increased exponentially. Now, old event managers don’t work as well in today’s modern enterprise environment. Event managers need to be “smarter” so they can provide actionable information to IT Operations vs. “here is ten thousand alerts.” Machine learning is the only answer to this big data problem for IT operations.

Machine learning is defined as software algorithms that provide computers with the ability to identify meaning in data (like events and alerts) in real-time, without being explicitly programmed to understand what is normal and abnormal. Doing this in real-time is critical, but it all has to provide early warning as an incident starts to unfold, well before customers have to complain. By ingesting data across all application and infrastructure domains, a machine learning-based event management system can not only detect events/alerts in real-time, but can also correlate and contextualize information so that it is meaningful to IT Operations. For example, during a service slowdown or outage IT Operations might receive several hundred or thousand alerts via their event manager. Without machine learning, IT operations have no way to correlate or contextualize those alerts to isolate where and why the problem is occurring.

Moreover, machine learning is the only solution making it possible to correlate, contextualize and create clusters of related alerts known as “situations.” Managing one or two situations is better than trying to manage thousands of disparate alerts. Once situations are created, all the relevant stakeholders (Dev, Ops, DBA, Sys Admin, etc.) can then be invited within a virtual war room to collaborate and resolve the incident.

Here are four ways machine learning is being applied to event management to help enterprise IT teams become more effective and responsive to evolving business needs:

  1. Reduced Mean Time-To-Detect (MTTD) – Machine learning can be applied in real-time from multiple event sources to analyze and detect anomalies before they become systemic and are reported by end users. This can lead to a 75 percent reduction in MTTD.
  2. Reduced Mean Time-To-Resolve (MTTR) – Machine learning can be used to reduce noise from alerts and provide actionable data that is correlated and contextualized for IT Operations. This allows IT Operations teams to make faster, smarter decisions so they can pinpoint where in the application infrastructure problems are surfacing. When integrated with virtual war room capabilities, teams can resolve issues faster using social collaboration (chat, discussions, etc.).
  3. Increased IT Operations Productivity – Simply put, machine learning allows IT Operations to spend 90 percent less time performing tedious manual investigations across all alert sources when service slowdowns or outages occur. This allows IT Operations to focus and fix more real incidents as they occur, without wasting time. IT Operations teams no longer need to create or maintain event manager rules to detect anomalies. This saves significant time and resources.
  4. Increased Customer/User Experience – Increased MTTD and MTTR means less service downtime, which translates to a better overall customer/user experience.


Sign up for the free insideBIGDATA newsletter.

Speak Your Mind



  1. Huber Diaz Pinzon says

    Hello, thanks form your interesting article about how to master it operations with machine learning.
    Can you tell me where I can found información or Books about this. How can I implement this too.

    Thanks a lot for your colaboration.