Spark 101: Spark Streaming and GraphX at Netflix

Print Friendly, PDF & Email

The Bay Area Spark Meetup recently was hosted at Netflix to feature talks by Netflix engineers about their use of Spark Streaming and GraphX, as well as a Q&A session with the Netflix folks plus the lead engineer of Spark Streaming. The presentation is provided here with the abstracts of the two talks below:

Talk #1: Spark Streaming Resiliency
By Prasanna Padmanabhan and Bharat Venkat, Senior Software Engineers

Netflix is a data-driven organization that places emphasis on data quality, availability and agility to capture and process that data. Some of our recommendation algorithms are computed as events happen in real time. Such streaming applications are long running tasks that need to be resilient. This is especially true in a cloud deployment due to the ephemeral nature of resources. In this talk, we will cover the What, the Why and the How of our resiliency exercise with Spark Streaming in an AWS cloud deployment. A Netflix ChaosMonkey based approach, which randomly terminated instances or processes, was employed to simulate failures. We hope that this exercise will help build confidence in the resiliency on Spark Streaming for similar contexts.

Talk #2: Spark and GraphX in the Netflix Recommender System
By Yves Raimond and Ehtsham Elahi, Senior Research Engineers

We at Netflix strive to deliver maximum enjoyment and entertainment to our millions of members across the world. We do so by having great content and by constantly innovating on our product. A key strategy to optimize both is to follow a data-driven method. Data allows us to find optimal approaches to applications such as content buying or our renowned personalization algorithms. But, in order to learn from this data, we need to be smart about the algorithms we use, how we apply them, and how we can scale them to our volume of data (over 60 million members and 10 billion hours streamed last quarter). In this talk we describe how Spark and GraphX can be leveraged to address some of our scale challenges. In particular, we share insights and lessons learned on how to run large probabilistic clustering and graph diffusion algorithms on top of GraphX, making it possible to apply them at Netflix scale.


Sign up for the free insideBIGDATA newsletter.

Speak Your Mind