AtScale, the company providing business users with speed, security and simplicity for BI on Hadoop, released the results of its reference performance study: The Business Intelligence Benchmark for SQL-on-Hadoop engines.
The benchmark is the world’s most comprehensive test of Business Intelligence workloads on Hadoop. The study reveals the strengths and weaknesses of the industry’s most popular analytical engine for Hadoop – Impala, SparkSQL, Hive and, new in this version, Presto.
As enterprises adopt Hadoop more broadly, business intelligence (BI) and analytical use cases on Hadoop have expanded from strong, but limited, adoption among data scientists.” says John L Myers, Managing Research Director at Enterprise Management Associates (EMA), “Now, organizations need to make the data within their Hadoop clusters available and ‘business critical’ to a wider business stakeholder audience. BI on Hadoop is a logical use case to help them accomplish that growth in adoption and acceptance.”
Some surprising findings that surfaced include:
- There is rapid innovation in the open source space, as reflected by Spark SQL improvements, even from 1.6 to 2.0: the study shows significant performance improvements between Spark 1.5 and Spark 1.6. Cloudera’s recent decision to donate Impala to the Apache Foundation will benefit the community, Cloudera, and any enterprise connecting business users to Hadoop.
- Different engines perform well for different types of queries: For large data sets Hive, Impala, and Spark SQL were all able to effectively complete a range of queries on over 6 Billion rows of data. There is no single “winning engine” for all query types.
- Impala scales with concurrency better than Hive and Spark: production enterprise BI user-bases may be on the order of 100s or 1,000s of users. As such, support for concurrent query workloads is critical. Our benchmarks showed that Impala performed best – that is, showed the least query degradation – as concurrent query workload increased.
Since the first edition of this study back in February 2016, AtScale researchers noticed significant changes to the benchmark results:
The increasing demand for BI-on-Hadoop workload has truly driven the community to innovate in a short period of time,” says Josh Klahr, VP of Product at AtScale. “The vendors sponsoring the Impala and Spark projects have been working diligently with the community to advance innovation in this field. We’ve aligned our vision with open-source engines since day one we are pleased to see that this bet is paying off: by essentially simply supporting the latest versions of Impala, SparkSQL and Hive, AtScale becomes up to 4 times faster on Big Data.”
On the contrary, vendors competing with open-source will see their competitive advantage dwindling as the community out-innovates them and vendors like AtScale continue to build on top of the open-source innovation.
BI on Hadoop: A Key Workload
As indicated in the latest Hadoop Maturity Survey, Business Intelligence is now a top workload for Hadoop, ahead of Data Science and ETL. The maturation of a number of technologies has enabled Business Intelligence to be deployed broadly, creating a unique opportunity for business users in the enterprise to finally be able to adopt Hadoop.
Until now, the industry has provided little guidance on the performance of Business Intelligence workloads on Hadoop. This has left technology evaluators with a void in measuring each engine against their own needs and workloads. The AtScale Benchmark Study is aimed at helping evaluators understand the differences across the leading SQL-on-Hadoop engines.
- SQL-on-Hadoop engines are well suited for Business Intelligence (BI): All tested engines – Hive, Impala, Presto,and Spark SQL – successfully executed all of the queries in our benchmark suite and are stable enough to support business intelligence workloads.
- There is no single “best engine”: We continue to see the different engines shine in different areas. Depending on raw data size, query complexity, and the target number of end-users enterprises will find that each engine has its own ‘sweet spot’.
- Version-to-version improvements are significant: The open source community continues to drive significant and rapid improvements across the board. All engines tested showed between 2x to 4x performance gains in the six months between the first and second edition of the benchmarks. This is great news for those enterprises deploying BI workloads to Hadoop.
- Small vs. Big Data: Impala and Spark SQL continue to shine for small data queries (queries against the AtScale Adaptive Cache). New in this edition, the latest release of Hive LLAP (Live Long and Process) shows suitable “small data” query response times. Presto also shows promise on small, interactive queries.
- Few vs. Many Users: While Impala continues to shine in terms of concurrent query performance, Hive and SparkSQL showed improvements in this category. Presto, new to this edition of the benchmarks, showed the best results in our user concurrency testing.
AtScale’s experience with each engines at large enterprises like Comcast, American Express, Aetna, Macy’s Home Depot, Groupon and many others helped guide the framework and methodology used for the industry’s most comprehensive BI-on-Big-Data Benchmark.
Sign up for the free insideBIGDATA newsletter.