In this special guest feature, Daniel Hardman from Adaptive Computing writes mining Big Data efficiently often requires a cloud-centric workflow.
Ask a hundred pundits, and you’ll get a hundred definitions of big data. Some suggest a specific size (“anything over 50 TB is big data”); others like to talk about the 3 Vs (volume, velocity, variety) or the 4 Vs (3 Vs + veracity). But I think the simplest definition is best:
Big data is any data too overwhelming to mine for insight with naive methods.
Notice the term “naive methods”—not “easy methods” or “familiar methods” or “old methods.” If you can think of a straightforward and practical way to get what you want out of the data, off the top of your head, it’s not big data. Even if your solution is expensive, big, or time-consuming. On the other hand, if using the data requires thoughtful weighing of tradeoffs and expenses, discussions with stakeholders, the creation of custom tools, trial and error, or the resetting of expectations, then you’ve met a big data test.
The other half of the definition is also significant — mine for insight. If all you want to do is dump data onto massive tape libraries and archive it for a decade, it’s not really in the big data sweet spot. You may be wrestling data, and it may be big, but you’re not really pursuing the problem that’s got the whole tech industry buzzing.
Big data’s raison d’etre is insight
Which leads us to cloud. Tackling big data without a cloud-centric worldview is sort of like building a skyscraper without doing a soil study first: you might make some initial progress, but sooner or later you’ll discover that you need to understand and thoroughly adapt an (inadequate) foundation. At a minimum, you’ll experience false starts and thrashing; in many cases, you may never place a capstone.
The reason for this claim goes back to the two-bolded assertions above. Cloud is all about dynamic environments, agility, adjusting, experimenting… If you’re going to do some analyzing, you want to do it without massive CAP-EX, so you can learn while the price is affordable. That’s cloud.
Cloud is also about flexible applications—scaling out, plumbing connections when they’re needed, renting access to world-class tools you could not otherwise afford. And that’s what you need for insight. Most of us don’t have the deep pockets to build or buy the computational horsepower of Google Big Query, or of Amazon’s Dynamo DB or CloudSearch or Elastic MapReduce. But with cloud, we can rent it. This makes entire categories of insight accessible to mere mortals.
The CIA didn’t hire Amazon to create an internal cloud just so they could run an intranet and internal wiki. They are building insight factories out of their intelligence, and they need a cloud to make it work.
Not all compelling tech problems live at this nexus, but an amazing number do–and the convergence is intensifying.
In related news, registration is now open for the Adaptive Computing user conference, MoabCon 2014, which takes place in March 31-April 3 in Park City, Utah.