Why You're Hearing More About "Fast" and Less About "Big"

In this special guest feature, Justin Langseth of Zoomdata gives his thoughts about three important technology trends driving the “Fast Data” conversation. Justin is Zoomdata’s CEO/Founder. Zoomdata is Justin’s 5th startup–he previously founded Strategy.com, Claraview, Clarabridge, and Augaroo. Justin is an expert in big data, business intelligence, text analytics, sentiment analytics, and real-time data processing. He graduated from MIT with a degree in Management of Information Technology, and holds 14 patents. He is eagerly awaiting the singularity to increase his personal I/O rate which is currently frustratingly pegged at about 300 baud.

It’s always been an oversimplification that the size of the data should be the dominant focus of the “big data” opportunity. The first wave of big data adoption solved the general execution of large amounts of data in batch, but it didn’t solve the far more interesting throughput, speed and latency problems that enterprises are wrestling with.

And that’s where we are today – the new focus is moving towards speed, and you’re starting to hear the concept of “time-to-analytics” (the time from ingesting of data to actually getting value out of it in terms of human interaction and synthesis with analytics) driving the CIO-level discussions around big data architecture decisions. Here are the hottest technologies and concepts that I see right at the middle of this fast data discussion.

3 Technology Trends Driving the “Fast Data” Conversation

Apache Spark
MapReduce solved the general execution of data, but in 2010 a new framework emerged that provides the right abstractions and design goals to put the emphasis on speed. Apache Spark is an open source data analytics cluster computing system that fits into the Hadoop open source community, is built on top of HDFS, and delivers performance up to 100x faster than MapReduce for certain applications (due to its execution in memory). UC Berkeley AMPLab (where Spark was born) recently published benchmarking statistics that showed new records for how fast systems can sort 100 terabytes (1 trillion records) of data. The core of Spark allows queries to bypass disk access and to easily store data in memory – which means queries are executed at maximum efficiency regardless of the type of data.

Apache Mesos
Big data has stretched enterprise operations to its limits. From peak volumes of data ingestion to throughput requirements – the old world model of small applications deployed to virtual machines on single servers no longer fits today’s world of large applications built for multicore, elastic scalability, and processing huge amounts of data. The new unit of developer abstraction in the big data world is the datacenter as a “single pool of resources” (a concept popularized by Mesosphere), and Apache Mesos is the distributed systems kernel that developers are leveraging to run big data frameworks across clusters of resources in the enterprise datacenter. When you start to cross into the territory of multi-tenancy, data locality, and other key big data performance concepts that support improved “time-to-analysis” – you enter the world of Mesos, where the type of “common services” approach that powers a personal computer operating system starts to make a lot of sense for the big data stack. I see Apache Mesos as a core technology in the fast data conversation, because at a certain point human beings trying to manage compute resources is an inherently limited approach to the type of flexibility and automation needed for big data operations. After all, big data developers don’t want to be in the business of writing infrastructure plumbing.

Data Fusion
Customers increasingly want to integrate batch and streaming data processing within a common framework, from runtime to analytics, usually to correlate real-time and historical information. The rise of streaming analytics means enterprises now have to join all that streaming data and that’s traditionally been a very hard thing to do. Getting data from disparate data stores and running analytics on them in real-time (or even right-time) is a huge technological challenge. And some of the data may be on-premise, and some in private clouds, and some in cloud-based SaaS services. At a technical level, the power of fusion comes from its unique ability to make multiple data sources appear as one source without moving data, but still allowing for highly performant analytics and visualization. At a business level, the power of fusion is allowing business users to join data sets and query them instantly, without having to wait for a data architect to set it all up.

Conclusions: Getting Ready for Fast Data

Remember when we used to FTP files over the Internet? When email would barf on a file greater than 2 megs, so you had to FTP a large file, and you might have even had to log into a VPN prior to uploading and sending that file? That was quite a bit of fuss for a problem that is so irrelevant to today’s Internet machinery that it would be absurd to even attempt to explain to your teenager those early days of sending files on the Internet.

I believe within a few years we’ll look back at the fuss of batch approaches to big data and it will all seem very silly and inefficient in hindsight as well. The typical data environment today still handles data as if it were before all of this streaming technology was possible. Transactions happen in one place, data is sent in batches, and then eventually in a data warehouse or lake the analysis happens against those batches. By the time the insight is gleaned, it’s against old data.

The first wave of big data analytics allowed people to analyze the past, and streaming reflects a movement towards looking at the present and predicting the future, to make your business smarter and faster. I believe Mesos, Spark and Fusion are at the center of this evolution, and are the key technologies for any CIO or developer to focus on to get there, faster.

Sign up for the free insideBIGDATA newsletter.

Why You’re Hearing More About “Fast” and Less About “Big”

Sponsored Guest Articles

Optimizing Performance and Cost Savings for Elastic on Pure Storage

White Papers

From complexity to clarity: Harnessing the power of AI/ML and risk-informed strategies to streamline clinical data management

Speak Your Mind Cancel reply

Featured RSS Feed

More News from insideHPC