Hadoop 101: The Future of Analytics – Real-time streaming

Print Friendly, PDF & Email

hadoop-101Unlocking the Potential of Big Data

It’s no secret that companies today are overwhelmed by the influx of data flowing through the enterprise from various sources at all times of the day. Increasingly, devices generate data and the volume can be massive. It’s known that these streams of data are valuable and that big data has the potential to change the way enterprises tackle business goals and day-to-day challenges, but most organizations don’t have the right tools to do so. Big data has big potential, but it can only be unlocked with the right tools and fully utilized with the help of real-time streaming capabilities.

Real-time streaming is essential for big data analytics and provides valuable insights from data being processed while current and most relevant, rather than hours or days later. With the big data environment evolving every day and inundating companies with large amounts of data, stream processing capabilities are more important than ever. Streaming allows users to take action in real time and make impactful decisions backed by data.

A real-time streaming platform must meet the needs of data scientists, developers and data center operations teams without requiring extensive patchwork of custom code or brittle integration of many third-party components that often fails. DataTorrent RTS is a fully Hadoop native streaming analytics solution. DataTorrent RTS provides an enterprise-grade platform, delivers tools and pre-built analytics modules and “lights out” data center operational capabilities.

Hadoop101_DataTorrent1

Visual Tools

One of DataTorrent’s main differentiators are its visual tools. The streaming application builder and visual data dashboard enables quick experimentation over real-time data. Data scientists and developers can create real-time streaming applications in an easy and intuitive manner, without programming skills. Historically, development of these types of applications has been exclusively in the hands of knowledgeable technical developers, but with the application builder and data dashboard, these types of applications become much more accessible to a wider audience of users, leading to greater adoption within organizations.

For the operation of streaming applications, DataTorrent offers a console, a visual interface to monitor and manage applications. Along with the graphical user interface comes a REST API that can be used for integration into the existing data center infrastructure.

Competing Platforms

Several other offerings on the market seek to establish themselves as similar big data processing platforms for large-scale data processing, including the popular Apache Spark Streaming and Apache Storm projects. Both focus mainly on the programming aspect and platform, and prove difficult to integrate and often leave companies seeking additional tools. Despite the hype around Spark and Storm, enterprises struggle with both offerings’ narrow focus and rarely get beyond the proof-of-concept stage.

If so many enterprises are struggling to incorporate Spark and Storm, what’s the draw? Enterprises typically lean toward Spark and Storm in the preliminary stages for several reasons. Both projects are well known in the open source ecosystem around Hadoop. They are recognized as alternative processing engines and therefore MapReduce alternatives (Spark’s origin is fast in-memory batch processing).

Spark and Storm’s well-known names are making companies rethink their data processing use cases and look to stream processing to reduce time-to-action, reinforcing the need for an enterprise grade streaming platform in the market.

Comprehensive Approach

Enterprises need an enterprise-grade streaming analytics solution, based on a platform that is optimally suited for stream processing. To this effect Spark Streaming and Storm are fundamentally limited by their core architecture and not sufficient to deploy mission critical applications in an enterprise data center.

In contrast, Spark and Storm are neither data scientist nor application developer friendly, as both require hand coding since there is no library of pre-built code. Data scientists and developers should be able to use intuitive visual tools to create streaming apps and iterate over their hypotheses. Programming is tedious, and with Spark and Storm, developers are forced to manually account for scalability, handle input data skews, hand-code fault tolerance for the application and attempt to force event ordering/re-ordering.

Spark and Storm solutions do not provide full visibility into metrics of application and infrastructure as required for a production system. DataTorrent covers these needs with the above mentioned console and ability to integrate with existing infrastructure. Beyond that, dynamic application updates are supported without taking down the system and keep up with the speed of business. Enterprises will demand a solution that, beyond development, fully covers the operational aspects including fault tolerance, monitoring, updating and diagnosis.

DataTorrent is running on top of YARN and HDFS on every commercial Hadoop distribution (CDH, HDP, MapR, Amazon EMR etc.) and can be used seamlessly in public or private cloud environments. DataTorrent empowers rapid time-to-market and time-to-value with its modular analytics capabilities that are easily combined using the visual application development interface. DataTorrent provides a library of over 450 pre-built Java operators that allow for a wide variety of analytical capabilities.

Architecture

Figure 2 shows the architecture of DataTorrent RTS. The entire system functions as a Hadoop 2.x-native application, running on top of YARN and HDFS. DataTorrent RTS fully leverages the underlying platform (scalability, multi-tennancy, security etc.) and coexists with MapReduce and other Hadoop jobs running on the cluster.

As a YARN native application, DataTorrent implements an application master (Streaming Application Master— StrAM), which serves as the brain and the bookkeeper of the streaming application. StrAM is responsible for managing the DataTorrent RTS application within the Hadoop cluster, as well as the DataTorrent real-time processing functionality.

 

Figure 2. DataTorrent RTS as Hadoop 2.x native application.

Figure 2. DataTorrent RTS as Hadoop 2.x native application.

 

Operators are the fundamental building blocks that streaming analytics systems use. DataTorrent RTS applications can be built from over 400 operators in the open source Malhar library (https://github.com/DataTorrent/Malhar). Custom operators can also be easily written, using the Malhar operators as starting point.

DataTorrent RTS applications are created by adding operators and streams to a DAG. To do this, developers work in their choice of Java IDE or in the graphical application design tool. Figure 3 illustrates the development process.

Figure 3. Application’s logical plan, consisting of operators connected by streams.

Figure 3. Application’s logical plan, consisting of operators connected by streams.

 

The typical DAG consists of one or more input sources connected through streams to other operators and finally to one or more output destinations. The Malhar operator library provides adapters for a wide variety of systems.

The developer can also set attributes that affect the physical plan and execution of the application, such as partitioning and affinity. There are a variety of attributes to control platform behavior and tune applications without changing the operators or application themselves. This is another key differentiator of DataTorrent RTS.

Streaming Window

Streaming applications have special needs in terms of both operability and functionality. For operability, applications need bookkeeping, fault tolerance and state management. From a functional standpoint, applications need to be able to accumulate data in a manner that fits specific business needs. In DataTorrent RTS, both operability and functionality constraints are accommodated with streaming window based event processing. Figure 4 illustrates how the event stream is processed in windows.

Figure 4. Window based event processing.

Figure 4. Window based event processing.

 

Windowing is a fundamental native capability of the DataTorrent RTS streaming platform. As implemented in DataTorrent RTS, streaming windows are finite time slices that constitute a large set of consecutive events. The system is non-blocking and no acknowledgements are necessary. Windowing results in sophisticated abilities to enable and boost linear scalability, as well as to provide for inherent fault tolerance as a fundamental system capability.

Summary

Enterprises are seeing big data and real-time streaming as an opportunity to better serve their customers, increase revenues and reduce costs through operational efficiencies.

As big data begins to outpace the capabilities of traditional databases, more companies will seek out real-time streaming platforms to help make sense of their data. Real-time streaming is helping enterprises unlock the potential of their data and make sense of data-in-motion. With the help of the DataTorrent RTS platform, users will be able to make better business decisions faster than ever.

Contributed by Thomas Weise. Thomas is principal architect at DataTorrent and has developed and architected distributed systems, middleware and web applications since 1997. Thomas joined DataTorrent at its inception. Prior to DataTorrent he was in the Hadoop Team at Yahoo! and contributed to projects like Pig and Hive and porting of the MapReduce based infrastructure to the next generation Hadoop 2.x.

 

Sign up for the free insideBIGDATA newsletter.

 

Speak Your Mind

*