Sign up for our newsletter and get the latest big data news and analysis.

When Bad Things Happen to Good Data Scientists

In this special guest feature, Shayak Sen, Co-founder and CTO, TruEra, conveys how he’s talked with dozens of data scientists who did everything right but still can’t ensure the quality of their models in production. It’s a vexing problem for our industry. When Shayak started building production grade machine learning models for algorithmic trading 10 years ago, he realized the need for putting the ‘science’ back in ‘data science.’ Since then, he has been building systems and leading research to make machine learning and big data systems more explainable, privacy compliant, and fair. Shayak’s research at Carnegie Mellon University introduced a number of pioneering breakthroughs to the field of explainable AI. Shayak obtained his PhD in Computer Science from Carnegie Mellon University and BTech in Computer Science from the Indian Institute of Technology, Delhi.

There has been a lot of recent press coverage about machine learning models not working as intended, leading to significant financial losses or negative public perception. The most recent case in point is Zillow shuttering its iBuying business and its stock dropping 30% partly because a pricing model wasn’t performing well enough. Zillow isn’t alone. This has happened to companies like Amazon and Apple, who have been publicly embarrassed by machine learning model snafus.

People might assume that those models were simply built the wrong way, that the data scientists were not very good at their jobs, or worse, that there was malicious intent. This is not necessarily the case. A model can appear to be effective in the lab, but then not perform as well when it is used in the real world. It can degrade in performance over time. It can be subject to sudden shocks, or to adversarial attacks.

I’ve talked with dozens of excellent data scientists about this issue. Even when you think that you are doing everything right, the black-box nature of machine learning models means that it’s extremely difficult to ensure the ongoing quality of their models in production. Here are two examples of common challenges where a model can get tripped up.

When the data or the meaning of the data changes

The pandemic has meant a sudden shift in not only data itself, but also the meaning of that data. With Zillow, initial signs indicate that the model didn’t track that the housing market was starting to cool, and thus didn’t adjust for that shift. When crude oil prices famously fell into negative pricing territory early in the pandemic, an energy company had to revisit its trading model, since the existing model had always dealt with pricing data that had never fallen below zero. Another company assessed creditworthiness partly by examining frequency of airline travel, with frequent flyers associated with greater disposable income. When lockdowns suddenly sent air travel into a dive, the meaning of the data changed, and travel, or lack thereof, was no longer a trustworthy data point on which to base a creditworthiness decision.

When the training data set lets you down

Sometimes, it’s the training data that misleads you. For example, when commercial facial recognition systems were evaluated for accuracy, researchers discovered that they were less effective at properly identifying women and darker-skinned individuals as compared to lighter-skinned men. When investigated more thoroughly, it was found that this came from a paucity of dark-skinned women in the training data. In this case, the training data itself was not wrong, but exhibited sample or representation bias. When put into production and tested against a much broader data set involved in real world use, it began to show disparities that would not have been found in the lab.

The dreaded question – Why?

One of the last things that a data scientist wants to have happen – the thing that keeps a data scientist up at night – is having the CEO, a critical customer, a major journalist, or a regulator ask, “Can you explain exactly why your model is doing this?” At the moment, this is an extremely challenging question to answer. It might take weeks or months to put together a good response, and it might ultimately not be a very satisfying answer, lacking the specificity that would give confidence to all of the relevant stakeholders. It took Apple and Goldman Sachs well over a year to reassure NY state financial regulators that they were not discriminating in their consumer credit offering, and even then the final report had some criticism of their lack of transparency.

It’s time for a new approach

Data scientists are very aware that it’s time for a new approach. Traditionally, model quality has been equated to measuring model accuracy. However, model quality needs to be expanded to include drift, fairness, and other forms of robustness. Thorough testing and monitoring – common in software development, but uncommon in the AI model world – can help improve model quality in the lab development phase, as well as minimize risks in the production phase, through rapid issue identification and resolution. Data scientists need a comprehensive approach that helps them to test, debug, and analyze models to ensure their quality before they get approved to go live, as well as to facilitate their ability to go live, by proving to the increasing number of stakeholders that they are of satisfactory quality.

Also, once models are running in the real world, they need to be actively monitored, to catch all of the ways that they can go astray, find the root cause, and fix them, so that they are back delivering value quickly. This helps you to identify that the training data might not have been representative. Or that the data has shifted, or that your model is coming under attack, or that you are just experiencing some drift that is degrading your model accuracy.

All of the best data scientists know that there is uncertainty and risk in model development and deployment. Bad things will inevitably happen, no matter how good the data scientist. Let’s give them the tools that they need to build better quality models in the first place, prove their quality, and ensure that they continue to function effectively and fairly in the real world.

Sign up for the free insideBIGDATA newsletter.

Join us on Twitter: @InsideBigData1 –

Leave a Comment


Resource Links: