In this special guest feature, Sean Suchter of Pepperdata takes a look at the most important attributes of a successful Hadoop deployment and offers best practices for organizations looking to maximize ROI from a Hadoop investment. Sean is CEO of Pepperdata. He was the founding GM of Microsoft’s Silicon Valley Search Technology Center, where he led the integration of Facebook and Twitter content into Bing search. Prior to Microsoft, Sean managed the Yahoo Search Technology team, the first production user of Hadoop. Sean joined Yahoo through the acquisition of Inktomi, and holds a B.S. in Engineering and Applied Science from Caltech.
Many organizations want to leverage the data they’re collecting to achieve a competitive advantage, deeper customer insights and untapped sources of revenue. Hadoop is one of the only platforms available today that can store and process data at this scale and pave the way for companies to gain meaningful insights — but deploying Hadoop is complex.
What are some tell-tale signs that a Hadoop cluster is well positioned for success? Several years of experience working with technical teams within large enterprises has led me to uncover five attributes that are typically present in successful Hadoop deployments and which can predict long-term success.
1. Cost-effective scalability
Hadoop is a distributed platform for data storage and processing, and as such, is designed to scale. But depending on how an organization plans to access, use and store its data, there are ways to take advantage of scalability in the most effective way.
One way of ensuring your Hadoop deployment is positioned for cost-effective scaling would be to avoid arbitrary scaling limits imposed by vendor solutions. For instance, I’ve seen customer costs spike tens of thousands of dollars because of unknown capacity ceilings. To avoid this, make sure you understand growth provisions in your vendor’s solution so that you don’t suddenly find yourself unable to scale quickly when needed due to the added expense.
2. The right toolset is deployed
Hadoop comprises a complex ecosystem of tools and add-ons that provide functionality above and beyond the standard MapReduce, YARN, and HDFS feature set. These include technologies like Spark, Impala, HBase, Pig, Hive and so on. A key characteristic of highly successful Hadoop deployments is that the appropriate set of tools has been chosen.
Choosing the right toolset is not a trivial task. Making the right choice requires that organizations support experimentation and training on new technologies. Having business processes and a culture that supports this type of approach is critical for deployment success. But how can you tell whether the right toolset is deployed for the task at hand? Tools that are well-suited for a given task will do most of what’s required out of the box. On the other hand, the wrong toolset can be easily identified if it requires significant customization or scripting in order to be effective. Of course, the more jobs, applications and tools you want to run simultaneously on a cluster, the more you’ll need to ensure appropriate allocation of resources and the ability to guarantee on-time completion of priority jobs.
3. Hadoop as a part of a predictable, recurring feedback loop
Certain use cases for Hadoop are more serviceable than others. Imagine you’re a firm that uses Hadoop to generate a nightly report on advertising spend. Now, imagine that analysis on this report lets your company adjust spending for the following day’s ads. This is a feedback loop. In this scenario Hadoop is an intrinsic part of a feedback loop that produces a measurable data product, which gets put immediately back into production. Reliable performance and real-time feedback is of the utmost importance when customers expect specific metrics to be met. Ensuring you can successfully meet imposed SLAs is a business demand that depends heavily on access to real-time data points that significantly impact job completion.
Using Hadoop in this way is more valuable than simply using the cluster to perform many one-off, ad-hoc analyses. This is because of the virtues of a feedback loop: like compound interest, the benefits of using data to automatically tune company performance will accrue steadily.
4. The highest return on time ratio
How much time do the development and operations teams spend fighting fires? How much energy is the organization spending to get value out of Hadoop versus energy spent maintaining a current deployment? These are important questions to ask when evaluating the success of a Hadoop deployment.
Ideally, no more than 10 percent of the technical team’s time should be spent on Hadoop cluster maintenance. If your team is hitting or exceeding this threshold, then you aren’t getting the appropriate value out of your deployment.
One of the primary indicators of a successful Hadoop deployment is that a cluster is generally accessible. This means that everyone in an organization who needs to run jobs or apps on the cluster can do so, without having to resort to lengthy approval processes. This might seem obvious, but hindering cluster access can result in missed deadlines and sluggish market response. This can unintentionally give competitors the upper hand.
Deploying Hadoop involves many choices, both technical and organizational. Given your organization’s investment in time and money for Hadoop, it’s essential to critically evaluate how accessible the deployment is, whether a cluster is being used in the most optimal way, whether it can scale quickly, whether the right tools from the ecosystem are being used and determining if there is complete control over the deployment. By increasing the efficiency and accessibility of your Hadoop deployment, you open up a wide range of opportunities for more of the organization to reap the benefits — ultimately allowing you to rely on Hadoop.
Sign up for the free insideBIGDATA newsletter.