Hadoop 101: Rethinking SQL for Big Data with Apache Drill

Print Friendly, PDF & Email

hadoop-101The Development Angle

The RDBMS has been the go-to storage platform for nearly three decades. They have been supported by IT organizations, which made them an easy choice to get software into production. The typical pattern for utilizing an RDBMS was for the software developer to use the language of choice via a standard connection like ODBC or JDBC to connect to the database. They would then create some type of mapping of their data structures in their code to the tables in the database. The mechanism to persist lists of data was by creating tables in the relational database with relationships between the tables that were one-to-one, or one-to-many. As time went on, some of the languages evolved frameworks (like Hibernate) to help simplify the task of mapping data structures into relational table schemas. The added benefit of these frameworks was to create an abstraction from the underlying database manufacturer to reduce effort in the case that a business wanted to switch to another database manufacturer. In theory this is nice, but seldom was that capability ever used. Additionally, even with these frameworks it didn’t simplify the testing phase of the software. How to persist and load the data still had to be tested extensively to ensure that all the relationships were properly handled. The other problem these frameworks didn’t really solve was that they only offered portability from one RDBMS to another RDBMS. The frameworks didn’t really offer any other type of abstraction.

A New Persistence Model

When dealing with most data structures, it seems like it would make a lot more sense to persist the entire data structure as is. This would make it truly portable, as well as easier to develop and test. An added benefit would be a considerably lower barrier for anyone in the organization to analyze the data without having to understand a complex relational database schema. No need to figure out which tables to join with which other tables and which fields they need to join the tables upon. The more complex the data structure, the more complex the queries.

A typical view of a data structure would look something like: Structure A contains fields of data, lists of data, and lists of other structures. If this model were persisted with JSON, it would then only take two lines of code to persist / serialize the data, and one line of code to deserialize the data back into the data structures, as shown here with this example Java code:

// 1 – Serialize a Java POJO using the Google GSON library

Gson gson = new Gson();

String json = gson.toJson(yourObject);

 

// 2 – Deserialize a Java POJO using the Google GSON library

YourObject yourObject2 = gson.fromJson(json, YourObject.class);

Why Now?

So why would we now consider this model as opposed to 10 years ago? Many things have changed to enable support for this model. First is the NoSQL revolution. Databases like HBase, BigQuery, MapR-DB, MongoDB and even Cassandra have enabled databases to be built to easily store complex data structures with ease. Those databases, while very functional for many use cases, don’t solve all the problems with how the data in these data structures would be used. Transactional and analytical processing are the two use cases to consider. A simple example of a transactional use case would be to create or retrieve a record to work with, make changes, and put the changed data structure back into the data store. The simple analytical use case would deal with querying that data structure to gain business insights.

Until Apache Drill came along, there haven’t been any tools that can perform querying complex data types like JSON without first transforming them into a simpler data format. The problem with transformations is that they end up creating additional work that must be performed in order to analyze the data. When there is a variety of data types or data sources, multiple transformations must occur, which complicate the management of such a system.

Drill’s Capabilities

Apache Drill is ANSI SQL:2003 compatible. This means that all of the people who already know how to use SQL to write queries can use Drill with no new training. Additionally, Drill ships with standard ODBC and JDBC drivers. This enables the standard tools already being used in a business to solve the problems of the business can be used. No additional new tools are required. Those business intelligence tools can plug right into Drill to query the data in its native JSON format without have to be transformed.

The other great thing is that Drill has zero barrier to entry. It can be installed on an individual’s laptop to query data sets that fit on the laptop. This makes it easy for people to get started with Drill and to be able to start building new queries without being connected to a live database.

Conclusion

Apache Drill coupled with the proper NoSQL data store opens up the opportunity for a single data source to be used for both transactional and analytical processing. There is no need to export or transform the data it into an application-specific or even a star schema format in order to load into a data warehouse.

Standard tools can continue to be leveraged by the business without further costs, thanks to support for standards like ANSI SQL:2003 and ODBC / JDBC drivers.

Time saved by not having to create software to serialize and deserialize the data into the data structures for any given language, coupled with simplified software testing and less code to maintain, add up to a BIG savings for any business.

Jim_Scott_MapRContributed by  Jim Scott, Director of Enterprise Strategy and Architecture at MapR. Jim has held positions running Operations, Engineering, Architecture and QA teams. Jim is the co-founder of the Chicago Hadoop Users Group (CHUG), where he has coordinated the Chicago Hadoop community for the past 5 years. Jim has worked in the Consumer Packaged Goods, Digital Advertising, Digital Mapping, Chemical and Pharmaceutical industries. Jim has built systems that handle more than 50 billion transactions per day. Jim’s work with high-throughput computing at Dow Chemical was a precursor to more standardized big data concepts like Hadoop.

 

Sign up for the free insideBIGDATA newsletter.

Speak Your Mind

*