Sign up for our newsletter and get the latest big data news and analysis.

5 Misconceptions of ML Observability

In this special guest feature, Aparna Dhinakaran, Chief Product Officer at Arize AI, explains five of the biggest misconceptions surrounding machine learning observability. Arize AI is a startup focused on ML Observability. Aparna was previously an ML engineer at Uber, Apple, and Tubemogul (acquired by Adobe). During her time at Uber, she built a number of core ML Infrastructure platforms including Michaelangelo. She has a bachelor’s from Berkeley’s Electrical Engineering and Computer Science program where she published research with Berkeley’s AI Research group. She is on a leave of absence from the Computer Vision Ph.D. program at Cornell University.

Over the last year, we’ve spent countless hours with ML engineers to understand and improve the results of their machine learning initiatives.

From upstarts that are building their entire businesses around the use of ML to some of the world’s largest financial institutions, ML techniques are increasingly powering crucial pieces of technology that people interact with daily.

Despite massive investments in ML, we’re still in a highly experimental phase and rates of success vary wildly from application to application. 

A common question we are hearing from customers is: Once models are out in the world making decisions, how do we make sure that these technologies are actually working? The truth is, delivering high-quality ML models continuously is hard, and making sure these models continue to perform well long into their life in production is even harder.

ML Observability is the key to bridging this gap. ML observability is the fundamental ability to peer into the performance of your ML model to get to the bottom of what’s going wrong and to resolve the underlying issue.  As a result, ML observability empowers teams to continually deliver high-quality results from the research lab through production.

Unfortunately, the emergence of ML observability tools has not led to a foolproof playbook of what to do when your model performs erratically in the lab or in production. Due to the complex nature of many ML application scenarios, each with its own set of complications, there just isn’t a one size fits all solution for all teams.

If we think about it, this is not surprising since often employing new technologies involves searching through uncharted territory filled with plenty of wrong turns and missteps.

As tools emerge to facilitate the three stages of the machine learning workflow–data preparation, model building, and production–it’s typical for teams to develop misconceptions as they attempt to make sense of the crowded, confusing, and complex ML Infrastructure space. It seems worthwhile to dispel a few of the most common ones in the ML observability space. 

Here are five of the biggest misconceptions: 

1) Discovering a problem is half the battle

Although many people often say that knowing is half the battle when resolving an issue with your model, learning that something is up is simply the battle cry.

Much like 90% of icebergs are under the waterline, the vast majority of challenges in managing complex ML systems lie not in understanding that the iceberg is on the horizon (monitoring), rather the real issue is what lurks below (observability). 

Teams that rely on a “red light/green light” monitoring paradigm will struggle to keep their critical models working well in production due to a number of ways that a model’s performance can slowly go off course.

With ML observability, teams can expedite time-to-resolution by moving beyond knowing a problem exists to understanding why the issue emerged in the first place and how to resolve it.

2) The ML lifecycle is static

In ML environments, ML models are constantly being fed data that is fundamentally not a static input.  On top of that, some models are designed to continuously evolve in production, commonly referred to as online models.

Furthermore, the task that a model is trying to perform may change over time.  At the end of the day we are all trying to build models that accurately reflect some phenomenon in the real world.  As almost everyone knows change is inevitable in the real world, and to think otherwise is setting yourself up for failure.

The dynamic nature of ML models and the environments they operate in requires ML teams to keep a close eye on their model’s performance to understand how their models are responding to changing data and a changing task. 

3) ML observability is just about production 

It’s true that many of the problems that teams are faced with when employing ML in their product are found in production; however, ML observability principles can also actually help stamp out some of these problems earlier in the model development process.

ML observability can also be applied in training and validation steps of model building to better understand how your models are making mistakes. For example, observability tools can help predict whether models will meet quality and performance expectations once in production. 

On top of that, ML observability can provide early signals of success or surface issues by comparing your model’s performance to those of previous model versions. In addition, they can also help provide clusters of examples that you are performing poorly on, which provides paths for improvement before you deploy your model.

Observability tools allow ML teams to set a baseline reference to compare production performance against (is the model working the way I thought it would in training/validation?). If not, observability into your model can quickly help you detect the issues and get to the root of why the problem arose in the first place.

4) ML observability only matters when you have real-time models/serving in real-time

While many applications enjoy real-time ground truth for their model’s predictions, performance data for some models is not immediately available due to the nature of the application. 

In consumer lending, for example, access to real-time model performance is impossible as there is a delay between when a loan is approved and when the customer either pays the loan off or defaults.

In the credit card industry, ML systems are trained to detect unusual credit card transactions, such as large purchases or geographic anomalies. In many cases; however, fraudulent activity is not flagged until a credit card is reported stolen. This can happen days, weeks, or even months after a transaction is cleared.

These scenarios, and a number of others, highlight the challenges facing model owners when there is a significant time horizon for receiving the results of their model’s predictions. 

ML observability, however, is able to overcome the constraints of delays in model performance data through the use of proxy metrics which can identify where/why slices of problematic predictions arise. 

For example, in predicting which consumers are most likely to default on their credit card debt, a potential proxy metric for success might be the percentage of consumers that make a late payment.

These and other proxy metrics serve as alternative signals that can be correlated with the ground truth that you’re trying to approximate and serve as a powerful tool in providing a more up-to-date indicator of how your model is performing.

5)  You need production ground truth

For many production ML models ground truth is surfaced for every prediction, providing real-time visibility into model performance. In digital advertising, for example, model owners can analyze the accuracy of the predictions of an A/B ad test and the outcomes can be used to optimize campaigns based on user engagement. 

In the absence of ground truth, however, teams can still gauge model performance over time using the following proxy methods:

  • Hire human annotators or labelers to provide feedback on their model’s performance. This approach can be expensive and time-consuming; however, the reward for having a set of high-quality ground truth data is immense.
  • Leverage performance or lagging performance. These lagging performance metrics are not quite as good at signaling a sudden model performance regression in a real-time application, they provide meaningful feedback to ensure that the models’ performance is moving in the right direction over time.
  • Measure the shift in the distribution of prediction outputs. Drift can serve as a proxy for performance and alert the team of aberrant model behavior even when no ground truth is present. Some metrics you can use to quantify your prediction drift are distribution distance metrics, such as Kullback-Leibler Divergence, Population Stability Index (PSI), Jensen-Shannon Divergence, and others.

As ML observability emerges as the missing foundational piece of ML infrastructure, its applications and benefits are continuously revealed. As practitioners, we invite you to explore how ML observability can be used to deliver and continuously improve models with confidence and gain a competitive ML advantage.

Sign up for the free insideBIGDATA newsletter.

Join us on Twitter: @InsideBigData1 – https://twitter.com/InsideBigData1

Leave a Comment

*

Resource Links: