From the SciPy2013 conference, here is a compelling talk “Data Agnosticism: Feature Engineering Without Domain Expertise” by Nicholas Kridler of Accretive Health in Chicago.
In the presentation below, Hadoop luminary Doug Cutting gives us some of his perspectives on the big data industry as well as a high-level overview of the Hadoop technology stack.
“The Hadoop MapReduce framework grew out of an effort to make it easy to express and parallelize simple computations that were routinely performed at Google. It wasn’t long before libraries, like Apache Mahout, were developed to enable matrix factorization, clustering, regression, and other more complex analyses on Hadoop. Now, many of these libraries and their workloads are migrating to Apache Spark because it supports a wider class of applications than MapReduce and is more appropriate for iterative algorithms, interactive processing, and streaming applications.”
“In this talk we summarize the results of the BIG project including analysis of foundational Big Data research technologies, technology and strategy roadmaps to enable business to understand the potential of Big Data technologies across different sectors, together with the necessary collaboration and dissemination infrastructure to link technology suppliers, integrators and leading user organizations.”
“The Hadoop framework has become the most popular open-source solution for Big Data processing. Traditionally, Hadoop communication calls are implemented over sockets and do not deliver best performance on modern clusters with high-performance interconnects. This talk will examine opportunities and challenges in optimizing performance of Hadoop with Remote DMA (RDMA) support, as available with InfiniBand, RoCE (RDMA over Converged Enhanced Ethernet) and other modern interconnects.”
The panel discussion video below comes from the Los Angeles Spark Users Group. The talk fosters a lively discussion on Spark’s initial goals, where it came from and what the future holds for Spark. Many leading Big Data vendors are responding by introducing Spark’s capabilities into their architectures. The panel discussion is between the top Hadoop distribution vendors – Cloudera, MapR, and Pivotal.
“When organizations operate both Lustre and Apache Hadoop within a shared HPC infrastructure, there is a compelling use case for using Lustre as the file system for Hadoop analytics, as well as HPC storage. Intel Enterprise Edition for Lustre includes an Intel-developed adapter which allows users to run MapReduce applications directly on Lustre. This optimizes the performance of MapReduce operations while delivering faster, more scalable, and easier to manage storage.”
Datameer, an end-to-end big data analytics application for the Hadoop ecosystem, today introduced Datameer 5.0 with Smart Execution, a new patent-pending technology that intelligently and dynamically selects the best-of-breed compute framework at each step in the big data analytics process.