It’s Time for Reinventing Data Services

During the last decades, the IT industry has used and cultivated the same storage and data management stack. The problem is, everything around those stacks changed from the ground up — including new storage media, distributed computing, NoSQL, and the cloud.

Combined, those changes make today’s stack exceedingly inefficient — slowing application performance, creating huge operational overhead, and hogging resources. An additional impact of today’s stack is multiple data silos that are each optimized to a single application usage model, and the requirement for data duplication to handle the case when multiple access models are used.

With the application stack now adopting a cloud-native and containerized approach, what we also need are highly efficient cloud-native data services.

The Current Data Stack

The figure below shows, in red, the same functionality being repeated in various layers of the stack. This needless repetition leads to inefficiencies. Breaking the stack into many standard layers is not the solution, however, as the lower layers have no insight into what the application was trying to accomplish, leading to potentially even worse performance. The APIs are usually serialized and force the application to call multiple functions to update a single data item, leading to high overhead and data consistency problems. A functional approach to update data elements is being adopted in cloud applications and can solve a lot of chatter.

In the current model the application state is stored with the application (known as stateful or persistent services). This is in contrast to the cloud-native architecture and leads to deployment and maintenance complexity.

When developing cloud-native apps in the public cloud, data is stored in shared services such as Amazon S3, Kinesis, or DynamoDB. Meanwhile, the application container is stateless, leading to lower cost and easier-to-manage application environments. To save the extra overhead, new data services use direct attached storage and skip the lowest layer (external storage array or hyper-converged storage), but that process eliminates only a small part of the problem.

In the last few years, new types of media – including SSD, NV-Memory solutions, key/value disks, shingled drives – have emerged. Bolting a sector-based disk access API or even a traditional file-system on top of those new media types may not be the correct approach; there are more optimal ways to store a data structure in a specific media, or the media may offload portions of the upper stack. The media API needs to be at a higher level than emulating disk sectors and tracks on elements with no spinning head. A preferred functional yet generic approach would be to store structures of variable data size with an ID (key) that will be used to retrieve that data (also called key/value storage) .

Time For a New Stack

The requirements are simple:

Don’t implement the same functionality in multiple layers
Enable stateless application containers and a cloud-native approach
Avoid data silos; store the data in one place that efficiently supports a variety of access models
Provide secure and shared access to the data from multiple applications and users
Enable media-specific implementations and hardware offloads
Simplify deployment and management

An optimal stack has just three layers, as is illustrated in the figure below: Elastic applications, elastic data services, and a media layer.

Elastic applications want to persist or retrieve data structures in the forms of objects, files, records, data streams and messages, all of which can be done with existing APIs or protocols mapped to a common data model, and in a stateless way which always commit updates to the backend data services (this will guarantee that apps can easily recover from failures and that multiple apps can share the same data in a consistent fashion).

The elastic data services expose “data containers” which store and organized data objects serving one or more applications and users. The applications can read, update, search or manipulate data objects, with the data service providing guaranteed data consistency, durability and availability. Security can be enforced at the data layer as well, making certain only designated users or applications access the right data elements. Data services should scale-out efficiently, and potentially can be distributed on a global scale.

The media layer should store data chunks in the most optimal way for the media, which might include a remote storage system or cloud. Data elements of variable sizes can be assigned unique keys to retrieve the data rather than accessing fixed-size disk sectors. By adding a key/value abstraction, we can implement a media-specific way to store the chunks. For example, when using non-volatile memory one would use pages; with hard drives, one would use disk sectors; and with flash one would use blocks and eliminate redundant flash logical-to-physical mappings and garbage cleaning. The media or remote storage may support certain higher level features such as RAID, compression, deduplication, and more, in which case the data service can skip those features in software and offload it to the hardware.

Contributed by: Yaron Haviv, CTO and founder of iguaz.io, with deep technological experience in the fields of cloud, storage, networking and big data. Prior to iguaz.io, Yaron was the VP of Datacenter and Storage Solutions at Mellanox, where he led technology innovation, software development and solution integrations for the data center market. Yaron was the key driver of open source initiatives and new solutions with leading storage vendors, enterprise, cloud and Web 2.0 customers. Before this, Yaron was the CTO and VP of R&D at Voltaire, a high performance networking, computing, and IO company.

Sign up for the free insideBIGDATA newsletter.

It’s Time for Reinventing Data Services

Sponsored Guest Articles

Optimizing Performance and Cost Savings for Elastic on Pure Storage

White Papers

From complexity to clarity: Harnessing the power of AI/ML and risk-informed strategies to streamline clinical data management

Speak Your Mind Cancel reply

Comments

Featured RSS Feed

More News from insideHPC

It’s Time for Reinventing Data Services

Sponsored Guest Articles

Optimizing Performance and Cost Savings for Elastic on Pure Storage

White Papers

From complexity to clarity: Harnessing the power of AI/ML and risk-informed strategies to streamline clinical data management

Join Us On Social Media

Speak Your Mind Cancel reply

Comments

Related Posts

Featured RSS Feed

More News from insideHPC