Beyond Big Data: Why AI Requires Getting Small Data Right

In this special guest feature, Maciej Gryka, Head of Data Science at Rainforest, discusses why big data in the context of AI leads us to ask some serious questions about the future of big data. Data scientists often wonder whether we need big data as much as some think. In many cases the answer is “no” and instead of going big, what we really need to be doing is thinking smaller. Read on to learn why. Maciej is Lead Scientist at Rainforest, an on-demand QA solution that helps companies rethink QA and bring it into today’s era of continuous delivery. Previously, Maciej worked as a computer vision researcher after getting a doctorate from UCL in London.

The allure of capturing as much data as possible is strong. And, now that more businesses are experimenting with machine learning and AI, it’s growing stronger. When you aren’t sure what you may eventually need, might as well capture everything, right?

But having more data isn’t always better — just ask Equifax. More data also means it gets harder to manage and gain valuable insights, and leverage workable data sets to accomplish specific tasks and achieve the desired outcomes. Discussing big data in the context of AI leads us to ask some serious questions about the future of big data. For data scientists like myself, I wonder whether we need big data as much as some think. In my view, in many cases the answer is “no” and instead of going big, what we really need to be doing is thinking smaller. Here’s why.

The case for small data

Just like you can’t build a skyscraper without the proper foundation, you can’t really do big data right until you master the art of harnessing small data first.

What is small data? Think of it as any business data set that can sit on a single machine. Small data is much more manageable, and devoid of the high costs (not to mention compliance and regulatory risks) of big data, which can require a massive amount of work to manage, maintain and keep clean. Small data, even if it comes in unstructured form, can also be labeled somewhat easily. Sure, a company with as much resources like Google or Facebook might be able to perfectly label their unstructured big data too, but the reality is most companies don’t have the luxury to do the same.

That’s not to say big data doesn’t have a place. But rather, if you aren’t able to find ways to manage and leverage your small data, your efforts to “go big” will most likely be a disappointment. I would argue this is why we are still in the fairly early stages of true enterprise AI; companies are still figuring out what to even use AI for and how to ask questions of it, much less have a great pulse on the data they need (and don’t) to get the answers they want back.

So, in what cases is small data really better than big data? Here’s a personal one: At Rainforest, an on-demand QA solution, we’ve eschewed using big data for some of the most critical problems we’re solving through machine learning in favor of small data. One example is with our software tester vetting process. As background, Rainforest offers human testers via an API, bringing their talent in at the right time during testing. We wanted to know what testers we could trust and which ones to de-emphasize to our customers. So, we gathered a few thousand samples that signaled when a tester used best practices or not. Our Rainforest experts (including some engineers and product managers) then labeled those examples. This didn’t take much of anyone’s time, and those couple thousand data points turned out to be enough for us to train our machine learning algorithm in the way we needed.

More importantly, as an organization this forced us to develop and solidify our best practices for using machine learning in production. This has paved the way for us working with larger datasets more effectively throughout our business; small data gave us a much simpler entry point.

Big Data or Small Data?

Here are some quick questions to ask yourself to determine whether big or small data is the best tool for your next machine learning or AI job:

Do you already have the data you need, and is it labeled? If you have multiple terabytes of data but it’s not labeled, that’s going to be pretty tricky to use (if your big data set is labelled you may indeed want to use that for your next project — but this is an idealistic, and rare, scenario). However, if all you have is an idea, before going out trying to gather big data, look around. Maybe you have a usable small dataset that is better for the job at hand. Even if it’s not labelled yet, you might be able to fix that by investing a little bit of time in exchange for a more agile solution that will get you far enough.
What’s your use case and what is the minimum data needed to address it? A word vector model trained on a massive Google News dataset might work, but a simple linear algebra might give you comparable performance on many real-world tasks. In the technology world, we talk a lot about having a “minimum viable product” and the same thinking applies when it comes to data. To maximize efficiencies and reduce costs, you want to use the minimum amount of data required to get the job done.
How advanced is your organization (really) when it comes to AI/ML? It’s important to build up your organization’s capabilities step-by-step, rather than going straight to the most difficult problems (even if they are the most exciting). If your organization is newer to experiments with machine learning, solving some basic problems with small data is likely the best place to start. Once you get some wins under your belt, you can scale from there.

The Case for Using Small Data

Big data isn’t going anywhere, but it isn’t the right path to solve every machine learning problem. Just like building elegant software, a great AI or machine learning algorithm should be about doing more with the least amount.

As it turns out, there is nothing bad about thinking small.

Sign up for the free insideBIGDATA newsletter.

Comments

Chuck Emary says

January 9, 2018 at 9:32 am

Hello,
This is the most sane article I’ve read recently on Big Data. If you cannot ask the right questions around a small data set, how on earth are you ever going to manage and understand the complexity of big data? I think too often organizations treat technology as a silver bullet to solve various internal/external issues. Software vendors are all too eager to help them along the way with the latest trending stuff. Great Article!

Cheers,
Chuck

Allen Bonde says

February 16, 2018 at 9:12 pm

Great piece – and good to see that more folks are evangelizing the benefits of “thinking small!” In fact the small data movement has been building for a few years now, back to work that Rufus Pollock did at Open Knowledge in the UK, and boosted of course by Martin Lindstrom’s book. I even wrote about the topic (Are “Small” and “Smart” Keys to Your Big Data Success?) in this very publication back in 2014 and readers may also want to check out http://www.smalldatagroup.com for definitions and frameworks for applying small data – with posts going back to 2012.

Weston Gardner says

August 21, 2018 at 8:35 am

Really great stuff, I think this echoes the contrarian opinions set forth in Small Data by Martin Lindstrom. I think we summited big data peak and are headed back to a more rational examination. If interested in continuing the discussion on the importance of marrying small data and big: https://www.martinlindstrom.com/small-data/

Beyond Big Data: Why AI Requires Getting Small Data Right

Sponsored Guest Articles

Optimizing Performance and Cost Savings for Elastic on Pure Storage

White Papers

From complexity to clarity: Harnessing the power of AI/ML and risk-informed strategies to streamline clinical data management

Speak Your Mind Cancel reply

Comments

Featured RSS Feed

More News from insideHPC

Beyond Big Data: Why AI Requires Getting Small Data Right

Sponsored Guest Articles

Optimizing Performance and Cost Savings for Elastic on Pure Storage

White Papers

From complexity to clarity: Harnessing the power of AI/ML and risk-informed strategies to streamline clinical data management

Join Us On Social Media

Speak Your Mind Cancel reply

Comments

Related Posts

Featured RSS Feed

More News from insideHPC