Sign up for our newsletter and get the latest big data news and analysis.

Help! My Data Scientists Can’t Write (Production) Code!

Businesses across the world are hiring data scientists to beef up their efficiency and competitiveness via artificial intelligence (AI). Startup companies (dubbed AI-First companies) are disrupting traditional industries like banking, insurance, real estate and healthcare using AI technologies.

The demand for data scientists far exceeds supply. And, the problem is exacerbated by the fact that the data scientist profession is itself splitting into multiple sub-disciplines. Some examples of this divide include:

  • Decision scientists have domain expertise and specialize in linking the domain knowledge and the algorithm to solve a problem.
  • Data scientists have expertise with machine learning (ML) and related algorithmic fields at the application level, i.e., they know how to apply algorithms to data sets to generate successful experimental insights.
  • ML research scientists can create new algorithms to solve more custom problems or adapt/exploit recent research advances.

Regardless of which of these skill-sets are needed, businesses face a common problem when trying to monetize successful AI experiments. None of the above skill-sets are required to include a strong competency in production software development. This is not the key value the data scientist brings to the problem. It is, however, a critical aspect of getting an ML application (which itself is a program, frequently in Python, R or Java) running for a business purpose. Not only that – the level of programming interest and capability varies widely from role to role and individual to individual.

Example industry horror stories:

  • Hired an expensive data scientist with vast ML knowledge, but they did not specialize in production-grade code. This led to constant battles between the data scientist and software engineering.
  • Promising experiments sit around as R code. No one can translate it into production grade code (in whatever language) because the data scientist left and now no one can understand how to port the code to production while maintaining ML integrity.
  • A small AI-first team finally hired a data scientist whose mathematical knowledge was great and who also enjoyed writing code, testing and pushing to production. They lost him/her to Facebook/Google/Amazon/etc.
  • A large company needs to hire many different data science roles. All of them code differently. While some really enjoy coding, others prefer to focus on the domain problem and the algorithm. There is no standardization of production quality and Ops teams will not accept the code.

To convert a promising ML experiment into a production-grade set of pipelines, the following things need to occur:

  • The inference pipelines, and frequently the training pipelines, need to be converted to production-grade code that can execute at scale with performance SLAs, handle errors gracefully, and be hardened for long-running execution without bugs, resource leaks, etc.
  • This code needs to be part of a software development lifecycle (SDLC) to enable versioning, quality control, rollbacks and continuous integration/continuous deployment (CI/CD) into production.
  • Pipelines need to be tested at scale to ensure they can meet performance and stability goals, and also to make sure the models tested in small-scale notebooks or development environments perform accurately at scale.

So, how should a business manage the reality of the skill-sets of hard-to-hire data scientists and the rigorous requirements of production? There are a number of components that should be implemented to help data scientist and business teams successfully navigate this challenge together. These elements include:  

  • Easy reuse of components so once a production-grade module is written and placed into a repository, others can use it with confidence. This reduces development time and increases production code resilience.
  • A streamlined pipeline builder where a data scientist can create simple to complex production pipelines without writing a single line of code.
  • A rich repository of built-in components for doing everything from feature engineering to model training, scoring, etc. for common and advanced ML and deep learning algorithms.
  • Production-grade automatic code generation.
  • For those data scientists who like to write their own code, coding best practices and guidelines for ML pipelines provide examples that can be adapted or adopted by data scientists and teams.
  • Multiple ways to upload code, either explicitly or via Git.

There is no easy answer to the problem of data scientist coding variability. However, implementing the above elements allows different types of data scientists to work in their comfort zone – from no code to custom code control over every aspect of the ML application. It also ensures operations teams that the code running in production is production-grade regardless of data scientist preference or coding skill level. And finally, integrating SDLC practices with MLOps (production ML) practices certifies that all code, ML or not, is managed, tracked and executed safely.

About the Author

Nisha Talagala is Co-founder, CTO/VP of Engineering at ParallelM where she pioneered MLOps (Production Machine Learning). Nisha has more than 15 years of expertise in software, distributed systems, machine learning, persistent memory, and flash. Nisha earned her PhD at UC Berkeley on distributed systems research. Nisha holds 63 patents in distributed systems, algorithms, networking, storage, and performance. Nisha is a frequent speaker at both industry and academic conferences and serves on multiple technical conference program committees.

Sign up for the free insideBIGDATA newsletter.

Leave a Comment

*

Resource Links: