The Data Disconnect: A Key Challenge for Machine Learning Deployment

Print Friendly, PDF & Email

With so many machine learning projects failing to launch – never achieving model deployment – the ML team has got to do everything in their power to anticipate any impediments to model operationalizing, be they technical challenges or a lack of decision-maker buy-in. One major pitfall on the technical side is the data disconnect, when engineers are unable to integrate a model into existing operational systems because they have not sufficiently planned a way to feed a model the right inputs on the fly. In this article, I describe a leadership tactic to overcome this pitfall that has been successfully implemented by a leading data consultancy, Elder Research, Inc.

If the struggle to deploy predictive models is a battle, then the challenge of hooking up its inputs is right at the frontlines. Somehow, a deployed model must receive the right set of values each time the model is invoked. At the moment a model is to score an individual case, it needs its inputs—the values that characterize that case. Having those inputs at the right place at the right time may be the very trickiest engineering challenge when architecting for deployment.

The problem stems from the data disconnect, an abominable divide between model development and deployment. When preparing the training data, the data scientist is typically focused only on incubating a model and ensuring that it performs well in “the lab.” To that end, they set up the input variables—positioned as columns in the training data—in whatever ad hoc manner is most convenient.

This leaves a formidable challenge for deployment. The system housing the model will need to recreate the variables exactly as the data scientist set them up during development, mimicking the form and format they held within the data scientist’s system or within the ML software, both of which are typically foreign to the engineers.

In that endeavor, every detail matters. For example, consider a model that takes an email address domain as input. Should it be a string of characters like “yahoo” or “gmail”? Or should it also include the “.com”? Must it be all lowercase? Should Boolean variables like “US citizen—yes or no” or “Has opted in for marketing email—yes or no” be represented as 1 and 0, “yes” and “no,” or “Y” and “N”? How do you represent a value that’s simply unknown, a.k.a. a missing value—is it the word “NULL,” an empty string, a negative one (–1), or something else? How do you calculate the SAT verbal-to- math ratio if the math score is unknown, considering that dividing by zero is impossible?

When it comes to transferring a model from one system to another, it’s like we’re stuck in 1980 typing commands at a DOS prompt with no spell check. Get any detail wrong, and the model doesn’t work as it should.

To make matters worse, model inputs may originate from various siloed sources across the organization. Since the inputs were designed to comprehensively represent much of what’s known about an individual, the databases that hold them could reside across disparate systems. For example, demographics may come from a customer relationship management database, while variables such as “Already seen this ad before—yes or no” may only be available by scanning an operational log to check. Pulling these together on the fly at scale during deployment presents an engineering challenge that data scientists often fail to anticipate.

It’s a tough job. According to a 2021 survey of data engineers, 97 percent feel “burned out” and 78 percent wish their job came with a therapist. Although that’s not a joke, the report, by DataKitchen and data.world, couldn’t resist asking, “Tell me about your motherboard.” 

One Firm’s Firm Approach to the Data Disconnect

Getting the data right and having it in the right place at the right time is 80–90 percent of the problem.

—Scott Zoldi, chief analytics officer, FICO

The antidote to the data disconnect? A new connection. Model development and deployment must be bound and inseparable. The two have traditionally been handled discretely, as isolated steps—conceptually linked yet decoupled in practice—but successful leaders seek to unify them so that preparing the data for modeling and engineering the inputs for deployment are one and the same.

But this means asking data scientists to change their habits and to accept some new responsibility. Many have grown accustomed to thinking up and implementing input variables at will during the model training step—without paying heed to how they’ll be made available during deployment. With a focus on developing and evaluating models offline, they view engineering as a distinct job, department, and mindset. Data scientists often see themselves in the business of prototyping, not production.

Nothing breaks techie habits like executive authority. Enter Gerhard Pilcher, the president and CEO of Elder Research, a widely experienced data consulting firm with which I’ve collaborated many times. Gerhard has instilled best practices across the firm’s client projects that have data scientists collaborating in detail with data engineers from the beginning of each modeling effort.

I asked Gerhard if he had implemented this change with a rule prohibiting data scientists from cobbling together their training data in a vacuum. He shied away from “rule,” but he put it this way: “We discourage ad hoc data aggregation. That change took a little while to take root.” His firm but friendly leadership ushered the team through a culture shift and into a new paradigm.

Under the guidance of this improved practice, data scientists request the model inputs that they will want available for model deployment from the data engineers rather than only hacking them together on their own for the training data. It’s a bit less impulsive and a bit more team spirited. With this process in place, the data infrastructure to support deployment—called the data pipeline—is already being constructed even during the model training step. Come deployment time, the process to deliver inputs on the fly is repeatable and reliable. This is because the pertinent data sources have been pre-connected during model development. This way, “once you’ve tuned and validated the model,” Gerhard says, “you can deliver the result much more easily.”

By designing the data pipeline early, you not only proactively prepare for deployment—you also win by recognizing infeasibilities early, moving up project decision points and even failing fast when needed. Since some data sources can be costly to integrate, “the client will experience sticker shock,” warns Gerhard. “We can preempt that shock and ease the blow, or cancel if necessary. The sooner you kill an effort that’s not deployable, the better.”

This makes deploying ML projects a scalable endeavor. My early projects would have benefited—without it, I had to brute-force my way to deployment by painfully detailing the inputs’ calculations within a  “Scoring Module Requirements” document and hoping the engineers would get all of it right. 

Beyond the data disconnect, Elder Research has also learned other hard lessons about the change-management challenges of deployment, the struggle to gain acceptance from those on the ground. ML “often dictates a major change in how people act,” says founder John Elder. “Many people revert to the old way of doing things instead of trusting the model. We studied this and found several ways to improve the environment of trust—both technical and interpersonal. People (often rationally) fear change. They don’t want to abandon the way they make decisions. The most important way to address that is to work side-by- side with potential allies from the very beginning and earn their trust.”

These process improvements worked. By implementing them, Elder Research boosted its deployment track record. During the first decade after the company was founded in the mid-1990s, only 65 percent of the models they developed for clients were deployed, even though 90 percent met predictive performance requirements. This success rate was about three times higher than that of the industry as a whole, but the firm was determined to do better. By implementing these new practices, over the following ten-year period, the firm’s model-deployment rate soared from 65 to 92 percent, and its model performance success rate rose from 90 to 98 percent.

The proactive tactic of establishing a tight connection between model development and deployment is a perfect example of the proper strategic, end-to-end ML practice needed to achieve model deployment.

This article is excerpted from the book, The AI Playbook: Mastering the Rare Art of Machine Learning Deployment, with permission from the publisher, MIT Press. It is a product of the author’s work while he held a one-year position as the Bodily Bicentennial Professor in Analytics at the UVA Darden School of Business. 

About the Author

Eric Siegel, Ph.D., is a leading consultant and former Columbia University professor who helps companies deploy machine learning. He is the founder of the long-running Machine Learning Week conference series and its new sister, Generative AI Applications Summit, the instructor of the acclaimed online course “Machine Learning Leadership and Practice – End-to-End Mastery,” executive editor of The Machine Learning Times, and a frequent keynote speaker. He wrote the bestselling Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die, which has been used in courses at hundreds of universities, as well as The AI Playbook: Mastering the Rare Art of Machine Learning Deployment. Eric and his books have been featured in The New York Times, The Wall Street Journal, The Washington Post, Bloomberg, Harvard Business Review, and many more. 

Sign up for the free insideBIGDATA newsletter.

Join us on Twitter: https://twitter.com/InsideBigData1

Join us on LinkedIn: https://www.linkedin.com/company/insidebigdata/

Join us on Facebook: https://www.facebook.com/insideBIGDATANOW

Speak Your Mind

*