Distributed Data Caching For Big Data

Print Friendly, PDF & Email

The advent of the emerging field of Big Data brought with it the promise of highly-scalable databases that could handle terabytes of data at a time. As anyone dealing with colossal datasets will attest to however, it gets rapidly difficult to effectively manipulate, store and retrieve information in large quantities and how to determine if a cache is necessary.

If we’re to deliver solutions worthy of the digital age, we need to be exploring new approaches that surpass the current limitations in today’s database technologies – congestion and scalability obstacles that occur both at the network and hardware levels serve as major impediments to the management and manipulation of big data sets.

Why Use a Cache in Big Data Applications?

The importance of a cache is self-evident: it reduces the strain on a database by positioning itself as an intermediary layer between the database and the end users – broadly speaking, it will transfer data from a low-performance location to a higher-performance location (consider the difference in accessing data stored on a disk vs accessing the same data in RAM). When a request is made, the returned data can be stored in the cache in such a way that it can be more easily (and more rapidly) accessed in the future. A query will initially try the cache, but if it misses, will fall back on the database.

It makes a lot more sense for applications that reuse the same information over and over – think game/message data, software rendering or scientific modelling. To take a simplified use case, consider a three-tier app made up of a presentation layer (the user interface), an application layer (handling the logic for the app) and a data layer (the backend hosting the data).

These three layers can be geographically separated, though latency would be a limiting factor as the three must constantly ‘speak’ to each other. Let’s now assume that each individual user in our app has a static data set that needs to be relayed to them each time they navigate to a new page – starting at the data layer and ending at the presentation layer.

If the data layer is constantly queried, it leads to high strain and poor user experience caused by latency. By introducing a cache, however, the data that is frequently accessed can be kept close by in temporary memory, allowing it to be rapidly served to the presentation layer.

Due to cost and speed considerations, a cache is somewhat limited in the size it can grow to. Nonetheless, where efficiency is concerned, it is a necessary addition to any high-performance database service.

From In-Process Caching to Distributed Caching

Many applications use the model described above for caching locally – that is, a single instance running alongside an application. There are a number of downsides to this approach, the most notable being that it doesn’t scale very well for bigger applications. On top of this, in the case of failure, states will likely be irretrievable.

Distributed caching offers some improvements on this. As the name may indicate, the cache is spread out across a network of nodes so as not to rely on any single one to maintain its state – providing redundancy in the case of hardware failure or power cuts and avoiding the need to dedicate local memory to storing information. Given that the cache now relies on a network of offsite nodes, though, it accrues technical costs where latency is concerned.

Distributed caching is superior in terms of scalability, and is often the model employed by enterprise-grade products – with many, however, licensing fees and other costs often stand in the way of true scalability. Moreover, there are often trade-offs to be made – it’s difficult to implement solutions that are both feature-rich and high-performing.

It’s perhaps important to note, at this stage, that vertical scaling (upgrading the processing power of machines housing a large database) is inferior to horizontal scaling (where the same database is split up and distributed across instances) in the case of Big Data tasks, as parallelization and rapid access to data are required.

Building Better Distributed Caches

In the digital age, it seems logical that distributed caching would be better suited to serve the needs of customers seeking both security and redundancy. Latency is currently an issue, but protocols such as sharding and swarming reduce it considerably for well-connected nodes.

Above all, we need to be able to deliver flexible middleware solutions that allow commercial entities to connect their databases to always-online networks of nodes, easing the burden placed on their backends and enabling them to better serve end-users with data. Scalability is perhaps the most important consideration in building Big Data applications, and it’s time to begin providing solutions that ensure it from the get-go.

About the Author

Neeraj Murarka is an engineer and computer systems architect with over 20 years experience. He is the CTO and co-founder of Bluzelle, which is ushering in a decentralized internet through it’s decentralized database protocol. He has worked for Google, IBM, Hewlett Packard, Lufthansa, Zynga, and Thales Avionics.


Sign up for the free insideBIGDATA newsletter.

Speak Your Mind



  1. Really educational and very thorough explanation.

  2. fantastically clear explanation of caching