Data Science 101: Mining Big Data with Apache Spark

Print Friendly, PDF & Email

Spark_logo_featureMining Big Data can be an incredibly frustrating experience due to its inherent complexity and a lack of tools. Reynold Xin and Aaron Davidson are Committers and PMC Members for Apache Spark and use the framework to mine big data at Databricks. In this presentation and interactive demo, you’ll learn about data mining workflows, the architecture and benefits of Spark, as well as practical use cases for the framework.

Dubbed the leading successor to Hadoop MapReduce, Apache Spark is a cluster compute system that makes data analytics fast — both fast to run and fast to write. Programs written in Spark can often outperform those in MapReduce by up to 100X, while being 10X shorter and more understandable. In addition, Spark also provides efficient support for streaming, query execution, machine learning, and graph computation through rich high level libraries. Last but not least, the project features one of the most active open source community in Big Data: 190+ developers from 50+ organizations have contributed code to the project.

This talk was given at the SF Data Mining Meetup group in San Francisco. The main speaker is Reynold Xin, a committer on Apache Spark and a co-founder of Databricks. Prior to Databricks, he was pursuing a PhD in the UC Berkeley AMPLab.


Earn your master’s in predictive analytics completely online from Northwestern University.

Speak Your Mind