An Overview of Spark SQL

Print Friendly, PDF & Email

An Insider’s Guide to Apache Spark is a useful new resource directed toward enterprise thought leaders who wish to gain strategic insights into this exciting new computing framework. As one of the most exciting and widely adopted open-source projects, Apache Spark in-memory clusters are  driving new opportunities for application development as well as increased intake of IT infrastructure. This article is the fourth in a series that explores a high-level view of how and why many companies are deploying Apache Spark as a solution for their big data technology requirements. The complete An Insider’s Guide to Apache Spark is available for download from the insideBIGDATA White Paper Library.

insideBIGDATA_Guide_SparkSpark SQL

SparkSQL is just the latest addition to the technology stack that provides access to big data. From an analytics perspective, an enterprise has a  significant amount of data and needs to turn its data stores into actionable insights. The average user experience doesn’t change regardless of the back-end data source. For all the inroads of NoSQL, a lot of data still resides in relational databases. Spark SQL translates traditional SQL or HiveQL  queries into Spark jobs, thus making Spark accessible to a broader user base. It supports multiple data and storage formats, including HDFS, Hive,  HBase, Parquet, JSON and Cassandra. Consequently, Spark SQL is a vital component of Spark. It was released in May 2014 and is now one of the  most actively developed components in Spark.

Shark, an earlier SQL-on-Spark engine based on Hive, was deprecated and Databricks built a new query engine based on a new query optimizer, Catalyst, designed to run natively on Spark. It was a controversial decision at the time within the Apache Spark developer community as well as internally within Databricks because building a brand new query engine necessitates a significant investment in engineering. A year later, more than 115 open source contributors have joined the project, making it one of the most active open source query engines. In this time, Spark SQL is outperforming Shark on almost all benchmarked queries. In TPC-DS, a decision-support benchmark, Spark SQL is outperforming Shark often by an  order of magnitude, due to better optimizations and code generation.

There are many SQL-on-Hadoop alternatives out there, such as Apache Drill, Impala, Pivotal, Actian and others. Many companies who already have Hadoop and have committed to one of these SQL solutions are now considering Spark SQL, but why? Some motivations include: Spark supports a wide range of features tailored to large-scale data analysis, including semi-structured data, query federation, and data types for machine learning. To enable these features, Spark SQL is based on the extensible Catalyst optimizer that makes it easy to add optimization rules, data sources and data  types by embedding into the Scala programming language. User feedback and benchmarks show that Spark SQL makes it significantly simpler and  more efficient to write data pipelines that mix relational and procedural processing, while offering substantial speedups over previous SQL-on-Spark engines.

sparksql-tpcds-perf

In terms of technology, Spark SQL runs as a library on top of Spark. The main abstraction in Spark SQL’s API is a DataFrame. Users can perform  relational operations on DataFrames using a domain specific language in a manner similar to R data frames. A seminal academic paper on SparkSQL  was written by people from Databricks and AMPLab, including Matei Zaharia, Co-founder of Databricks, and describes the motivations for adding this module to Apache Spark: https://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf

Databricks_Spark_usagePerformance of SparkSQL is competitive with SQL only systems on Hadoop for relational queries. As an example of an exceptionally large-scale deployment, Spark SQL has been deployed by a large Internet company to build data pipelines and run queries on an 8000-node cluster with over 100 PB of data.

Spark SQL is the most used component of Spark. Two thirds of Databricks customers using Databricks Cloud, a hosted service running Spark, use Spark SQL.

If you prefer the complete An Insider’s Guide to Apache Spark is available for download in PDF from the insideBIGDATA White Paper Library, courtesy of TIBCO. Click HERE to take in a webinar event recorded on November 17, 2015.

 

 

Speak Your Mind

*