Apache Spark MLlib 2.0 Preview: Data Science and Production

Print Friendly, PDF & Email

From the recent Spark Summit 2016 in San Francisco, the video presentation below by Joseph K. Bradley of Databricks give focus to “Apache Spark MLlib 2.0 Preview: Data Science and Production.” This talk highlights major improvements in Machine Learning (ML) targeted for Apache Spark 2.0. The MLlib 2.0 release focuses on ease of use for data science—both for casual and power users. Joseph discusses 3 key improvements: persisting models for production, customizing Pipelines, and improvements to models and APIs critical to data science.

  1. MLlib simplifies moving ML models to production by adding full support for model and Pipeline persistence. Individual models—and entire Pipelines including feature transformations—can be built on one Spark deployment, saved, and loaded onto other Spark deployments for production and serving.
  2. Users will find it much easier to implement custom feature transformers and models. Abstractions automatically handle input schema validation, as well as persistence for saving and loading models.
  3. For statisticians and data scientists, MLlib has doubled down on Generalized Linear Models (GLMs), which are key algorithms for many use cases. MLlib now supports more GLM families and link functions, handles corner cases more gracefully, and provides more model statistics. Also, expanded language APIs allow data scientists using Python and R to call many more algorithms.

Finally, the presentation demonstrates these improvements live and show how they facilitate getting started with ML on Spark, customizing implementations, and moving to production.



For our reader’s convenience, here are the slides for the presentation:


Sign up for the free insideBIGDATA newsletter.

Speak Your Mind