The Hadoop ecosystem is fertile with new technology offerings that pave the way for compelling new deployment options. Hadoop-as-a-Service (HaaS) is one such direction we’re seeing take significant evolutionary steps forward as of late.
One such step was taken recently by Qubole, a managed Hadoop-as-a-Service offering now available on the Google Compute Engine (GCE). Qubole previously was only available on Amazon’s AWS. Community reactions have been largely positive and it seems decision makers consider the big data theme as a potential killer app for GCE. Qubole’s founders Ashish Thusoo and Joydeep Sen Sarma built and ran Facebook’s data service to over 25PB with hundreds of users and tens of thousands of queries each day. They also founded and authored Apache Hive.
With HaaS (also known as Hadoop in the cloud) come different deployment options:
- Rolling your own deployment, that is, installing Apache Hadoop or one of the distributions (Cloudera, Hortonworks, MapR) in an IaaS (Infrastructure-as-a-Service) offering, such as GCE or EC2. This allows for fine-grained control over what is running but also comes with deployment and management complexity.
- Pre-packaged services such as Amazon’s Elastic MapReduce (EMR) or Savvis’ Big Data offering that help with reduced deployment complexity and offer mid-level control over installed services.
- Managed HaaS such as Qubole or Mortar, promising reduced deployment and management complexity.
The key differences of HaaS versus on-premise deployments are around elasticity, spot pricing, separation between compute and storage (for example, eventually consistent object stores such as Amazon’s S3 or Google’s Cloud Storage, and enhanced security standards. Managed HaaS offerings such as Qubole are often used in development cases, for evaluation and testing, short-running analysis jobs and to realize hybrid cloud setups. They do, however, also come with their own limitations:
- Getting data into the cloud and getting it out again has its own price tag.
- There may be privacy and data protection issues stemming from legal requirements that prevent or limit the use cases.
- The total cost of ownership of a 24/7 operation has to be calculated on a case-by-case basis.
- There is a general mismatch between Hadoop, Hive, etc. on the one hand and the eventually consistent object stores on the other.
For additional analysis on HaaS, Christian Prokopp (Data Scientist at Rangespan) wrote up a detailed comparison of Qubole and EMR.