Raising the Bar for System Availability

Print Friendly, PDF & Email

Rochna Dhand at SRK Headshot DayIn this special guest feature, Rochna Dhand, Director of Product Management at Nimble Storage, argues that a new standard has emerged that organizations must adopt to stay competitive: six-nines availability, or 99.9999 percent up-time. She examines the new standards, the challenges IT teams face from a reactive strategy to availability, and how a predictive approach can help reach a new level of availability. Rochna Dhand brings over 15 years of experience shaping early-stage groundbreaking technologies in the IT industry, at companies including Nimble Storage, VMware, Loudcloud (Opsware/HP) and SGI. She has served in engineering, product management and IT operations leadership roles. She holds a master’s in computer science form Clemson University and a bachelor’s in computer engineering from University of Mumbai.

In today’s fast-paced and competitive business environment, even a few seconds of delay to the up-time of business applications can have tangible consequences on organizations. These issues not only slow strategic decision-making, upend productivity and cause economic and competitive losses, but can also lead to customer dissatisfaction.

The previous standard for system and application availability used to be what the IT industry called “five-nines” – or applications were up and running 99.999 percent of the time. This expectation no longer cuts it for modern IT infrastructures. A new standard has emerged that organizations must adopt to stay competitive: six-nines availability, or 99.9999 percent uptime.

The up-time of critical business systems is particularly important for high stakes industries like finance, healthcare and service providers whose success relies upon quickly fulfilling customer needs.

As the bar for availability rises, IT organizations’ ability to meet it is diminishing. They are expected to do more with less. To make matters worse, application and IT infrastructure are becoming increasingly complex. Even when budget permits, finding skilled personnel is difficult.

The Challenge with Being Reactive

The traditional approach to availability is inadequate for meeting these high standards.

It is reactive and inefficient. Further, once there is a disruption to availability it can take significant time, effort and skills to restore the system – all very precious commodities. It can even lead to a nonproductive blame game between owners of the different tiers of the application stack.

Today, ensuring availability involves designing redundancy across every part of the application stack – typically the hardware and software infrastructure, database tier, middle-ware tier and the top tier. The theory is that if a piece of software or hardware fails, other parts will take over. Some systems such as storage employ RAID and similar approaches. Here the data on the faulty component is constructing using information stored on other similar components. Further, each layer of the stack is monitored using raw health and performance data, often using separate monitoring systems. When an issues is detected, an alert is sent with no context of impact on the application and its availability.

The IT admin now begins the fire drill – the arduous and time consuming task of troubleshooting. Typical steps are:

  • Determine if the alert was a false alarm or if there really is a problem.
  • Determine the severity of the issue, e.g. did it disrupt application availability or is it local to a given layer and absorbed by the redundancy design.
  • Determine the root cause of the problem. Often this requires collaboration across experts for each stack who may belong to different teams.
  • Often the vendor gets involved and the process of system data collection begins – sending countless log files, outputs of diagnostic commands and even config files to their Tech Support.

All this needs to happen in a very compressed period of time.

A Predictive Approach

To achieve the new standards of availability, the status quo way isn’t enough. Instead, the IT admin needs the ability to predict problems before they occur. The system should be able to fix itself preventing application downtime. If the system cannot self-fix the problem, it should give precise recommendation to the admin to implement the fix proactively. For those issues that are hard to predict, the system should make troubleshooting instant and painless. Further, false alarms have no place in a world that demands unprecedented data availability.

Machine learning and good data science makes all of this possible. Modern data center products are instrumented with millions of sensors. Vast amounts of telemetry is collected from each layer of the stack, across large number of deployments, in real-time. The telemetry contains data about performance, health, configuration, events, resource utilization and various system states. This is done across a large install base leading to knowledge of diverse environments and real-world configurations. The data is then processed in powerful analytics engines. A deep understanding of the entire stack is developed. The system learns of complex patterns in each layer of the stack, and how these patterns interact across the layers, over time. Models are created, then refined on a continuous basis using the data from the install base as well as new information fed by the product vendor. A clear and high confidence understanding of normal v. abnormal behavior is established.

The result is the ability to predict issues that could potentially cause application downtime and to do so with very high confidence. Further, such a predictive engine determines how to prevent the issue. Systems are designed to take these preventative steps automatically, or defer to the IT admin. There is a new breed of data center products that have this capability built-in – from instrumentation to collection of telemetry and analysis to prediction and prevention. There is even a class of products where the sensors were part of the fundamental product design and were created when the first line of code was written.

This predictive approach combined with machine learning techniques have made it possible to achieve availability levels once deemed impossible. IT admins receive fewer and more meaningful alerts. They can determine the root cause instantly, across the application stack, all by themselves. No need to understand the inner workings of each layer of the stack and engage with several teams. When the admin does need vendor technical support, the experience is very different from the traditional way. Tech Support already has deep knowledge of the customer environment and can start suggesting fixes within an incredibly short period of time.

This frees up IT to focus on more meaningful activities such as planning and executing on innovative ways of solving business problems.


Sign up for the free insideBIGDATA newsletter.

Speak Your Mind