The Lambda Dilemma: Uncovering the Layers of Complexity to Achieve your Fast Data Goals

Print Friendly, PDF & Email

hadoop-101Developers building on HDFS are absolved from the most difficult problems of distributed systems. HDFS manages data movement, persistence and it encourages users to treat it like an immutable store. These features enable new batch-oriented tools to be developed quickly and adopted aggressively, as they can be run in addition to an already-working system without introducing new risk.

Of course there is a catch. HDFS-based tools accept that the latency between data capture and consumable processing is measured in minutes or hours, not seconds or milliseconds. Despite this, the pattern of cobbling together many special-purpose systems to build a larger system, using glue code and luck, has been brought over to streaming problems with low latency requirements.

Let’s examine Storm, a framework for processing tuples in a stream. The user supplies the per-tuple processing code and the framework manages where that code is run, much like a MapReduce Hadoop job. Storm also manages the optional re-processing of tuples when software or hardware fails. While Storm has a ‘MapReduce for streams’ feel to it, there is a crucial difference: Storm is responsible for the data it is processing, while MapReduce processes data kept safe by HDFS.

Keeping with the Unix and Hadoop philosophy of specialized tools, Storm focuses on distributing processing while leaving other key functions to other systems. It’s typically deployed with ZooKeeper to manage agreement, Kafka to manage ingestion, and something like Redis or Cassandra to manage state. Now we’re operating and monitoring four systems, probably at least twelve nodes, and the interops/glue between all of the above. Each of these systems has different failure semantics, and can cause different symptoms or cascading failures in other parts of the stack.

While four systems wouldn’t be odd in an HDFS-based batch system, the crucial difference here is that user data ownership is passed between systems, rather than being wholly the responsibility of HDFS. The complexity of developing, testing and operating systems increases exponentially as the number of data transfers between systems increases.

The end result is a system that is difficult to run and even more difficult to verify. Many users accept that these streaming systems will occasionally be inaccurate or unavailable. So what to do?

Enter what people are calling the Lambda Architecture. Rather than address the above-mentioned flaws directly, Lambda simply runs both the “Batch Layer” and “Speed Layer” (stream processing) in parallel. The Speed Layer can serve responses in seconds or milliseconds. The Batch Layer can be both a long-term record of historical data as well as a backup and consistency check for the speed layer.

But there’s no getting around the complexity of a Lambda solution. Running both layers in parallel, doing the same work, adds redundancy, more software, more hardware and more places where two systems need to be glued together. This lack of natural integration makes Lambda systems difficult to run – and rely on – at enterprise scale. It also doesn’t address temporary correctness problems with the Speed Layer, merely assures that they will be corrected when the batch job finishes.

So why is the Lambda Architecture gaining acceptance? It’s important to understand that there are few good options when dealing with low-latency requirements at this scale. And Lambda Architecture systems have been deployed with some success, largely in web-scale or web-services companies. It’s easy to see the circumstances and history that led some to choose Lambda, but from an engineering point of view, the Lambda Architecture will likely prove a detour.

The single largest issue with these systems is lack of integration. This leads to complex development, complex testing, and complex operational support that may not have been anticipated. In contrast, the best solutions to Fast Data problems have been conceived with the big picture in mind. How will this system tackle ingestion, agreement, processing and state? Using discrete components to handle these four jobs is flexible, but at tremendous cost and risk. For many organizations, the Lambda Architecture will introduce unacceptable risk.

So when should you use Lambda? If your company has a Lambda-based solution in production solving some other problem, and it’s easy to see how that same stack could be used to solve your problem, leveraging current experience may make a lot of sense. Make sure your organization has experience monitoring and supporting systems and that the developers and ops for the existing system are happy with it.

Finally, make sure the new problem can be solved with the existing stack. If you find yourself adding another system to add “one more thing”, or find that you have to make extensive changes to existing glue code, you should consider alternatives.

If you’re starting from scratch, downloading a list of tarballs from the ASF and trying to figure out how to connect them to make a Lambda App, please stop. I know the feeling of staring at the recently un-tarred directories, the sense that anything is possible and the world is your ninja rockstar oyster. It doesn’t last, but bad decisions do.

John_HuggContributed by John Hugg. John is the Founding Software Engineer for VoltDB. VoltDB provides a fully durable, in-memory relational database that combines high-velocity data ingestion and real-time data analytics and decisioning to enable organizations to unleash a new generation of big data applications that deliver unprecedented business value. 


Sign up for the free insideBIGDATA newsletter.

Speak Your Mind