In this special guest feature, Kostas Tzoumas, Co-founder and CEO of data Artisans, explores what data streaming is, why is it important, what companies are doing with data streams, and how Flink fits into the bigger picture. Kostas Tzoumas is co-founder and CEO of data Artisans, the company founded by the original creators of Apache Flink. Kostas is a PMC Member of Apache Flink and earned a Ph.D. in Computer Science from Aalborg University with postdoctoral experience at TU Berlin. He is the author of a number of technical papers and blog articles on stream processing and other data science topics.
Unless you have been living under a rock for the past year, you’ve probably noticed increasing hype around the accelerated growth of real-time data, and the idea of managing data as streams. As one of the pioneering members of the Apache Flink™ project and a co-founder of data Artisans, the company formed by the initial Flink creators, I have been very close to this trend. In this article, I’d like to explore what data streaming is, why is it important, what companies are doing with data streams, and how Flink fits into the bigger picture.
At first glance, streaming data is enabling the obvious: continuous processing on data that is continuously produced. What does this mean? The real world produces information on a continuous basis in the form of events. Think, for example, of continuous user activity in a website, a sensor emitting readings continuously, or cars reporting their position frequently. A majority of interesting real-world data sets are not static or finite. They represent continuous change or new information, and they do not have a beginning and an end. Much like the limitations during the era of early commercial mainframe computers, until now we’ve been confined to a static view of data simply because the tools to process and analyze continuous data were not mature enough. This reality has changed with the invention of open source technologies like Apache Flink, Apache Kafka, Apache Beam (incubating), and others. Having your data available as event streams, and having the ability to do analytics on these event streams directly, reduces the time from data production to decision and increases the value of these events. The end result is liberating: instead of having to wait for a long cycle of batch processing until data is available, you can perform analytics immediately.
From a business perspective, there is an ever-growing need to monitor the business in real-time and provide push services and on-demand products to customers, as well as adapt to customer schedules, locations, and habits on a continuous basis. For example, in retail, it makes sense to provide product recommendations based on shoppers’ current activity rather than their historic activity. Even brick and mortar stores can benefit by offering deals the moment a customer walks into the shop. In financial services, being able to detect or even prevent fraud in real-time is of paramount importance. Last but not least, in manufacturing and telecommunications industries, it is important to monitor equipment continuously and to create alerts if anything goes wrong.
From a technology perspective, there’s been a fundamental shift around how services and products are developed, having evolved from a monolithic architecture to a decentralized one that is based on microservices. The data processing backend for such an architecture looks a lot more like event streams flowing through numerous decentralized streaming applications rather than through a single data store that serves all requests. In many companies, streaming analytics go well beyond real-time. Streaming is a robust way to implement continuous applications, for example, applications that need to process the incoming data 24/7 with good consistency guarantees. In fact, applications built using Apache Flink are already replacing home-grown applications and periodic Hadoop jobs in some production use cases.
Earlier approaches to data streaming lacked at least one of these three key characteristics: the ability to handle high-volume streams, the ability to guarantee correct results in the case of cluster failures or variations in the order that data is ingested into the platform, and the ability to report results with low latency. Because of these limitations, users were stuck with hybrid approaches like the lambda architecture or with trying to work around the limitations of micro-batch systems, which hindered large-scale adoption of streaming technology. The novelty of Apache Flink was that it handles all 3 key attributes for robust streaming: processing high-volume streams with low latency and strict accuracy guarantees. And the system does not neglect batch processing but, instead, treats it as a special case for stream processing. After all, if you think about it, a file is merely a continuous stream that happened to end.
To learn more about Apache Flink and real-time (or not) data stream processing, I invite you to check out the upcoming Flink Forward 2016 conference website, the project’s website, as well as the data Artisans technical blog. To get a feeling of how Flink is used in the real world, visit the Powered by Flink page.
Sign up for the free insideBIGDATA newsletter.