I recently caught up with Gwen Shapira, System Architect at Confluent, to talk about the market dynamics of new “fast data” technologies and what is driving its rapid adoption across large companies. Gwen’s experience includes Linux system administration, web development, technical leadership and many years of Oracle database administration. She studied computer science, statistics and operations research at the University of Tel Aviv, and then went on to spend the next 15 years in different technical positions in the IT industry.
insideBIGDATA: Please give us a short overview of Confluent and the importance Kafka is playing in the “fast data” and real-time arena?
Gwen Shapira: Confluent was founded by the creators of Apache KafkaTM. We provide a streaming platform for large-scale distributed environments, enabling enterprises to maximize the value of real-time data. Confluent allows leaders in industries such as retail, logistics, manufacturing, financial services, technology and media, to integrate isolated systems and create real-time data pipelines where they can act on new information immediately.
To improve strategic decision making as well as product offerings and customer experiences, organizations must learn how to process data in real-time and capitalize on streaming technologies. There is a huge difference in outcomes between organizations that can adjust offerings and prices in real time, reacting immediately to changes in demand versus organizations that can only react and update once a month.
With Kafka solidifying itself as the leading platform for handling real-time data, an increasing number of companies are leveraging fast and streaming data, including Netflix, Yelp, Uber, Dropbox, etc. In fact, more than a third of Fortune 500 companies are using streaming platforms in production.
insideBIGDATA: Why do you think the “fast data” movement is taking off right now?
Gwen Shapira: Businesses are more digital than they’ve ever been. However, many companies still have disparate systems and are largely siloed. In a world where many processes still involve dumping data out of disconnected systems, or where business processes have manual steps performed by humans, periodic batch processing of data is natural. As a business becomes fully digital, the daily cycle of batch processing makes less and less sense. For most modern businesses, their core data is a continuous stream of events. Customers look at advertisements, click on links and buy products at any moment, 24×7. Shipments arrive at all hours of the day, not just at 11pm. It is natural that the processing and analysis of this data would be continuous and real-time as well.
Also, the technology for supporting stream processing has gotten much better. We’ve learned a great deal about how to create stream processing systems so that they can not only scale horizontally to company-wide usage, but also so their capabilities are truly a superset rather than a subset of what was possible with batch processing. I think Apache Kafka deserves a lot of the credit here – there were attempts to do stream processing in the past (the traditional name is CEP: complex event processing), but they never took off in a big way. I think this is because without Kafka as a reliable streaming platform, there were no practical sources of event streams to process. This has all changed in the last few years. This area is still evolving rapidly, but forward-thinking companies around the globe are already implementing production stream processing applications in critical domains.
insideBIGDATA: What do you think is required to get it right?
Gwen Shapira: For streaming data to be successful, there needs to be a platform that can handle critical updates and deliver the data in order and without loss. It needs to be reliable enough for enterprise companies to trust financial information to the platform. In addition, the streaming platform needs to support throughput high enough to handle large volume log or event data streams. It needs to provide data with latency low enough for real-time applications. It also must be able to operate as a secured central system that can scale to carry the full load of the organization and operate with hundreds of applications built by disparate teams all plugged into the same central nervous system.
insideBIGDATA: What challenges do you see that must be overcome?
Gwen Shapira: The first challenge is collecting these streams of data – you want applications to start reporting real-time events into Kafka and you want to integrate information from databases, logs, sensors and even social media. Getting all the data is always a challenge, but Kafka has a wide collection of connectors and clients that make it easier.
The second challenge is that as the organization becomes more real-time, there are more applications integrated into the streaming platform, sending and receiving streams of events. This can become fragile – one bad record that enters the stream or one application making incompatible modification to its data can trigger a chain of failures across apps and services. This requires management of the relation between topics and the schemas of the events, and ensuring that only compatible changes and compliant data enters the data streams.
The third challenge is a good one: it has to do with growth! If all goes well and the platform is growing, you need to consider things like optimized resource utilization, maintaining balanced distribution of data and load, migrating applications to the cloud and managing use-cases that span data centers and geographies. Those are good challenges because we mostly see them in forward looking organizations that already adopted fast data at large scale and are realizing the benefits.
insideBIGDATA: What are the opportunities “fast data” and real-time data brings for companies of all sizes?
Gwen Shapira: Organizations’ adoption of real-time data allows them to be more customer centric, to spot new market opportunities, to optimize operations and reduce costs. Real-time data also allows any company to experience higher levels of visibility and predictability in enterprise operations. Fraud detection and customer 360 views that are updated in real time are two use cases that we see most frequently. I believe that ultimately streaming data will be the only way companies consume and process data, and will radically change the way we interact with data moving forward.
Sign up for the free insideBIGDATA newsletter.