Sign up for our newsletter and get the latest big data news and analysis.

From Yawn to YARN: Why You Should be Excited About Hadoop 2.0

By now almost everyone has heard the story of the yellow elephant who never forgets data, consumes whatever data you have from any source, and magically produces a big data treasure trove of business insights for you, including tweets, telemetry, customer sentiment, sensor readings, mobile app activity, and more! As great as this sounds – there are some realities around big data and Hadoop. First of all – the amount of data a typical organization has to process has grown substantially. Secondly – this information is a lot more valuable than ever before.

“The ability to capture data quickly,” George Corugedo, the co-founder and CTO of RedPoint Global, explained, “without worrying how it should be keyed, structured, and fed into a data model gives organizations and business units the flexibility, agility, and autonomy they need to respond to changing conditions without being limited by what IT can approve or implement.”

Yarn White Paper“Part of Hadoop’s appeal,” Richard Fichera, a Vice President and Principal Analyst at Forrester Research, explained, “is that it is not specifically optimized for any specific solution or data type but rather a general framework for parallel processing, so your developers and data scientists can add any relevant data, whatever its format or source.”

However, Hadoop 1 also had its limitations:

  • Required MapReduce (and Java) programming expertise, creating a skills gap
  • Only supported batch processing
  • MapReduce jobs, being the only type of application, left some data management functions outside Hadoop, creating a functionality gap and causing process inefficiencies due to data being moved in to and out of the Hadoop cluster
  • Distributed processing model restricted by MapReduce limitations

The good news – as this whitepaper from RedPoint illustrates – is that Hadoop 2.0 is a whole new elephant! In Hadoop 2, HDFS is still the data storage framework, but a new and separate resource management framework (the “operating system” for distributed computation) has been added called Yet Another Resource Negotiator (YARN). With YARN, a new ResourceManager coordinates the assignment of the sub-tasks of a submitted application to available nodes within the Hadoop cluster, enhancing the scalability, efficiency, and flexibility of applications. YARN uses a new ApplicationMaster to run as a dedicated and short-lived version of the old JobTracker, running applications on resources governed by a new NodeManager, a more generic and efficient version of the previous TaskTracker. Most important, the ApplicationMaster can run any type of application — not just MapReduce.

Download this white paper to learn how – “The emergence of YARN for the Hadoop 2 platform,” explained John Lilley, the Chief Architect for RedPoint Global, “has opened the door to new tools and applications that promise to allow more companies to reap the benefits of big data in ways never before possible with outcomes possibly never imagined. By separating the problem of cluster resource management from the data processing function, YARN offers a world beyond MapReduce: less-encumbered by complex programming protocols, faster, and at a lower cost.”

Ultimately, solutions like YARN enables stitching data quality into the very fabric of the Hadoop cluster since much needed, and mature, enterprise-class functionality, such as data standardization, data enrichment, identity resolution, and master data management, can now be executed natively in HDFS.

Download this white paper from the insideBIGDATA White Paper Library.

Leave a Comment

*

Resource Links: