In this special guest feature David Fishman, Vice President of Marketing for Arcadia Data, puts a new and interesting twist on the hot topic of big data adoption. David is responsible for overall go-to-market strategy and execution. In 3 years as VP marketing at Mirantis, he built the company’s marketing efforts from the ground up, driving the transition from services to become the dominant independent player in OpenStack cloud. He also anchored the team that took Mercury Interactive public. Other past roles include director of enterprise products for global mobile handhelds at HP, and director of the software update SaaS product line at Sun. David is an experienced Six Sigma Master Blackbelt and holds an MBA from Yale.
To read the headlines, it’s easy to think that the only thing bigger than ‘big data’ is big talk about big data, but in fact, it’s been about 3 years since Gartner’s Svetlana Sicular called the start of Hadoop’s “trough of disillusionment.” So it should come as no surprise that big data skeptics with entrenched interests in the status quo — IT specialists and their suppliers alike — are working overtime in waving flags with question marks.
However, no less an authority than Gartner recognizes that the trough of disillusionment is not the end of the road; rather, it’s traverses the inflection point leading to the “slope of enlightenment”. And with the accumulation of big data and analytic experience in the last several years, the shift shows no signs of slowing down. In 2016, big data will quite literally be the elephant in the room. Here’s what I predict you’ll hear from those who think they can continue to ignore it.
1. Business Users Can’t Get Access to Big Data
As organizations flatten and modern mobile technologies demolish old functional barriers, those same organizations have taken to wider dissemination of BI and visualization tools across their workforce. The idea is to facilitate more autonomy for those who are closest to living, breathing business processes, customers, and marketplaces. But that’s only half the equation. Until these users can get well-structured access to ever increasing sources of information, those tools are really an exercise in rearranging the slices on the pie chart. Without direct, well-managed, granular big data access, visualization puts the chart cart before the horse.
2. BI Tools Can’t Process at Data-lake Scale
With the large scale of repositories made possible by Hadoop and the broad range of data sources it can cache, the classic data warehouse is beginning to look like modest lake-front property. In some ways, it parallels the shift from spreadsheets with thousands of lines to visualization and BI tools that tackled unlocking insight from a few million records — the kind of leap that defies an approach of “the same, only more of it”. To unlock the value in big data, BI tools need to start with billions of records and up — think of every click on every mobile phone or every reading from a legion of IoT devices. Without scaling by several orders of magnitude, visualization and insights will be groping in the dark for relevant data. Successful adoption of big data just won’t fit technologies that work with mere millions of records.
3. All the Data Scientists are Busy
Looking for a data scientist to build on your big data? The odds are you’ve exchanged one IT labor shortage for another. One recent study cranked through 80 million Linked-in listings and found only 11,400 people who called themselves Data Scientists. And that’s even though Harvard Business Review called it “the sexiest job of the 21st century.” Consider that the top 50 companies where data scientists work employ 2100 of them, (Uber rounds out that top 50 companies with 19 data scientists), you probably shouldn’t count on a PhD statistician in a white lab coat to show up and save the day. The more accessible data is to everyone besides the data scientists, the sooner it will become a routine part of the business.
4. Big Data Isn’t Reliable, So We Can’t Rely On It
There’s no question that big data is not done growing up; 2016 marks a decade since Hadoop got started. But in technology, there’s a virtuous feedback loop between necessity and invention; in other words, no data becomes reliable until you rely on it. Much as students come to realize that you can’t pass a class without taking tests, the pressure of direct access, consumption, and discovery will do far more to expose and address the risks in big data than taking a wait-and-see attitude. Putting big data in front of more people more often is likely the fastest way to expose its flaws and drive improvement — in the source systems, the data management platforms, and the data itself.
5. Hadoop’s Not Mature Enough for Business Users
It’s fair enough to suggest that it takes mature infrastructure to run a mature business. But the fact is that risk-reward ratios favors a different investment strategy. Not only is the competition among Hadoop suppliers driving an intense engineering arms race, it’s done with transparent roadmaps and code-bases, thanks to the open source technology at its heart. The innovation dynamic in big data is fundamentally different from the central planning approach that drives roadmaps for proprietary software platforms. (More often than not, those proprietary platforms predated the Blackberry.) Hadoop’s rapidly accelerating capabilities are tackling long-fought database performance and management problems, and on a much faster trajectory than its predecessors. Add the strikingly lower cost of Hadoop’s data storage mechanism, and the risk-reward trajectory will leave mature technologies behind.
6. ETL for that Much Data is Slow and Complex
With the good news of business process automation successes over the last 25 years comes the bad news of data silos. Different business functions optimized their workflows and transactions by each focusing on its respective domain. Writing software has always been a key part of wrangling independently developed formats and schemas to try and create coherence across those silos. Early usage of Hadoop was as an ETL platform; it required much of the same programming skills required by earlier data management tools. The productivity of tools like Hive, Spark, Pig, and Cascading simplifies manipulation of data structures in code, with radical increases both in performance and in developer productivity. Is ETL a solved problem? Not yet; but no less an authority than Ralph Kimball argues that Hadoop, given the availability of mature tooling and its native flexibility to store multiple data formats, will rapidly become leading player in ETL for enterprises.
7. The Answers We Need Aren’t in the Places We’re Looking for Them
The old joke about the looking for lost car keys under the streetlamp because it’s dark everywhere else plays out all too often in today’s landscape of organizational data. It’s a logical consequence of tools that can’t span large-scale datasets and multiple data sources that lack a consistent method of access. With an explosion of collaboration technologies and social media practices, it’s become increasingly less difficult for people to find each other and ask each other questions. Why can’t we ask questions of the data just as easily? Access to data, sadly, lags access to people. The workforce is well aware it’s always-connected, smartphone-armed and Internet-of-Things-tracked; to stay competitive, they’ll need access to the breadth of data all those interactions create.
8. We Need to Move Our Data Out of Hadoop and Into the Data Warehouse/Data Mart/Visualization Server for Our Analytics
Before Hadoop, end users who wanted new data from IT had to first wait for provisioning of data warehouse capacity, followed by a data modeling project that eventually set data up for extraction into a data mart that could be accessed by data visualizations tools. It took too long and cost too much — and that was before data volumes expanded to Hadoop scale. Given how widely these tools are used, it’s no surprise that the default response to a surge in data consumption to go down this well-worn path. But it’s at the very least ironic to use Hadoop as the place you keep big data until you need to use it. Having to maintain a parallel data processing infrastructure to keep up with Hadoop isn’t sustainable. The volume, velocity and variety of big data is swamping these traditional architectures; moving it around only makes the process of getting results even slower. Using Hadoop’s native capabilities to run analytics at scale can be a critical success factor. …
9. We Can’t Set Up the Data Until We Know just What End Users Will Do It
The hard-fought gains of two-decades of business process automation, be they through ERP systems, customized middleware, or web services, were often constrained by the high cost of database execution. That had two important effects: First, database schemas needed to be defined precisely before applications could become operational. Second, it required that any analytics for business intelligence work from a single, well-organized set of data definitions to avoid wasting processing cycles on loosely defined dimensions or measures, costly joins, and DBA-driven changes in data layout to prevent performance degradation. Hadoop’s fundamental shift to schema-on-read via HDFS provides a faster, more flexible mechanism for exposing the structure of the underlying data without tying it down — in effect, decoupling what the data does from how it does it.
10. Just Putting Data in the Lake Has All the ROI You Need
The dramatic cost savings in data storage that Hadoop enables paved the way for broad interest in big data — so dramatic that it made sense to create large pools of storage just to offload the flow of data from rapidly increasing automated sources from standard data warehouses. That said, the Hadoop platform also provides compelling extensibility, comparable to an operating system: metadata via HDFS; security with authentication, authorization and encryption services native to the platform; and coherent execution management via YARN, to name a few examples. It’s interesting to note that data center infrastructure has gone through a similar transformation: where once it was sufficient to save some hardware dollars through server virtualization, large-scale innovators across sectors are taking their software-defined datacenter to IaaS clouds and containers. A cost-centered big data strategy will, at best, provide a stepping stone to competitive advantage with big data.
11. Cube First, Ask Questions Later
Ask any DBA: joins are expensive. To accelerate queries across tables, data administrators often turn to “data cubes” by pre-computing dimensions with multi-level aggregation, and pre-computing all the dimension relationships. But cubes also reduce granularity: they’re premised on aggregation, and low-cardinality dimensions (i.e., fields with fewer unique values). It’s a reasonable approach for small data — but for big data, it means you’d need to know in advance what dimensional relationships matter. Do you sacrifice time-of-day granularity? Reduce streets to counties? If you need work with measures or relationships that have not been pre-computed, you either have to give up speed, or give up granularity. It’s analogous to taking a train to reach a destination that isn’t near a station: you can’t easily reach the point you want, or change the railway to add a stop. Precomputed measures prior to analysis can create blind-spots in your analysis. In big data, that loss of granularity means you lose valuable insight. Hadoop’s key innovation is to deliver distributed compute horsepower to the data in place, rather than pre-diluting the data — and potential insights — to fit dated data management methods.
12. I Can’t Safely Grant Access to the Hadoop Cluster to End Users and Ensure Security/Compliance
Almost all of the many BI and analytics technologies in the market take an open-door approach to Hadoop security. Most often, a data-mart or visualization server to access to Hadoop data is to log-in as super-user. That’s good news and bad news: all the data is accessible, but there’s no direct logging or fine-grained control. What’s more, the security service used by the Hadoop platform — such as Kerberos — is often not shared by legacy BI tools. As it turns out, Hadoop supports best of class primitives for security with Kerberos, LDAP/AD, file-level access control. Arcadia augments it with true user passthrough, cell-level security, and control at the level of semantic data models and use-cases. Many DW architectures lack at least one of these, if not all of them. You might even say that holding Hadoop to the same security threshold as your Data Warehouse means you’d be lowering your standards.
No More Excuses
2016 marks a decade since Google’s papers on MapReduce and BigTable inspired Doug Cutting and what became the Hadoop community to rethink what was once called “data processing”. You need only look back to where Oracle was in 1988, when it reached its tenth year — and what the relational model did to the data landscape in the decade that followed — to give you an excuse to rethink the objections to big data you’ll hear this year.
Sign up for the free insideBIGDATA newsletter.