Sign up for our newsletter and get the latest big data news and analysis.

Why Your AI Workflow Needs Software-based Secondary Storage

In this special guest feature, Geoff Bourgeois, Co-founder and CEO of HubStor, discusses how one of the chief obstacles to executing AI is managing massive volumes of unstructured data. The trick is keeping this data intact to your AI/ML infrastructure without adding silos and layers of complexity. Geoff is an entrepreneur with extensive background in enterprise software and cloud technologies. He founded the company in 2015. Now in its fourth year, HubStor is helping some of the largest organizations in the world to transform their data management strategies with the cloud.

The most talked-about challenges within AI/ML projects are identifying the best use cases, realistically scoping the effort, sourcing the right talent, and collecting useful data in high volumes.

However, teams often under-estimate the storage and data management complexity that comes at the end of the AI workflow.

It is easy to see how this happens. When moving into the new ear of intelligence, in our excitement to get rolling, investment in AI-ready infrastructure typically focuses on the high-performance compute and primary storage tier.

But isn’t this enough?

If you’re doing serious, large-scale AI/ML, it is not.

Here’s the rub: Better insight from deep learning requires large data sets. In other words, the more data, the more insightful and accurate your AI will be. The challenge is that, eventually, your powerful servers become overloaded with these large-scale data sets. Moreover, while all of this data is important, not all of it needs to reside on your primary storage tier all of the time.

Ask yourself how much data you’ll capture and generate with your project. Are there legal or regulatory requirements to retain the data? Will specific subsets of data be essential to keep for very long periods?

For example, in the world of life sciences, it is not uncommon for an AI project to generate 50 million files per week, consume 20 terabytes of storage each month, and have data retention requirements of at least ten years.

If you’re going big in your AI project, then your AI-ready infrastructure also needs to support long-term data retention which probably means scalable secondary storage, data management software, and software-based cloud storage.

Oversight in these areas will lead to complexities and limitations in later stages of the AI workflow. Shortcuts and hasty decisions can leave your organization exposed in numerous ways and bring more complexity to your project.

‘Secondary storage’ you ask?

Ultimately, you won’t have the budget to keep all your data for deep learning within your primary storage tier. The need to keep your performance compute and storage tier fresh for analysis will necessitate managing your long-tail data storage intelligently to a secondary and even tertiary storage tier.

Enterprises that tap the real power of AI will have ready access to data no matter how much they have. To be one of these AI-data-management superstars, you know that you have to avoid legacy IT infrastructure which just isn’t going to cut it. And you know that any large-scale data management solution must avoid new complexities. The challenge is that many of the options on the market today are old solutions, rebranded for the cloud era, that fundamentally behave as new infrastructure.

One of the chief obstacles to executing artificial intelligence is managing massive volumes of unstructured data. The trick is keeping this data intact to your AI/ML infrastructure without adding silos and layers of complexity. Secondary, scale-out, and cloud storage components are essential to large-scale AI projects, but they shouldn’t be independent silos that make it challenging to tier and recall data sets as needed.

 

Sign up for the free insideBIGDATA newsletter.

Leave a Comment

*

Resource Links: