Cloudera Enhances Hadoop Usability and Accessibility for Data Scientists

Print Friendly, PDF & Email

Cloudera_logo_7212015As the amount of data continues to grow exponentially, data scientists increasingly need the ability to perform full-fidelity analysis of that data at massive scale. Cloudera, a leader in enterprise analytic data management powered by Apache Hadoop™, announced a number of new initiatives to enable data scientists to take advantage of big data and Hadoop for data analysis with more complex workflows.

Beginning with the introduction of Ibis, an open source project incubating within Cloudera Labs, the company is enabling advanced data analysis on a 100% Python stack—bringing a native Python experience to Hadoop at scale. Cloudera has also announced that, as a direct contributor and industry leader in education around Hadoop, Cloudera will be hosting and organizing the first-ever Wrangle Conference, an event focused exclusively on real-world applications of data science, from the startup to the enterprise.

Hadoop has evolved dramatically over the last decade, from a batch processing tool to an entire ecosystem that powers most of today’s information architecture as well as traditional BI workloads,” said Wes McKinney, a software engineer at Cloudera and the creator of Python pandas. “We want to build on this momentum and make Hadoop’s infrastructure more accessible to the data science community. We’re doing that by bringing Python more fully into the ecosystem and focusing on the real-world, practical applications of data science.”

Ibis

Cloudera recognized the importance of the Python language in modern data engineering and data science and how, thanks to its use of more complex workflows, it has become a primary language for data transformation and interactive analysis. Python development has been confined to local data processing and smaller data sets, requiring data scientists to make many compromises when attempting to work with big data. Using Ibis, a new open source data analysis framework, Python users will finally be able to process data at scale without compromising user experience or performance.

The initial version of Ibis provides an end-to-end Python experience with comprehensive support for the built-in analytic capabilities in Impala for simplified ETL, data wrangling, and analytics. Upcoming versions will allow users to leverage the full range of Python packages as well as express efficient custom logic using Python. By integrating with Impala, the leading MPP database engine for Hadoop, Ibis can achieve the interactive performance and scalability necessary for big data.

With its usability, extensibility and robust third-party library ecosystem, it’s easy to understand why Python is the open source language of choice for so many data scientists. However, we recognize its limitation – where it’s unable to achieve high performance at Hadoop-scale,” said Wes McKinney.  “With Ibis, our vision is to provide a first-class Python experience on large scalable architectures like Hadoop, with full access to the ecosystem of Python tools.

Ibis is available as a preview in Cloudera Labs, a virtual incubator for new projects that further enrich the Hadoop community and ecosystem. Ibis is an Apache-licensed project and open to contributions from the open source community.

For more details, read about the technical vision for Ibis HERE.

Wrangle Conference

In light of Hadoop’s wide ranging flexibility and practicality, and as data scientists can now leverage its power to solve some of today’s most pressing problems, Cloudera has announced Wrangle, a single-day, single-track industry event that will dive into the principles, practice, and application of data science from the startup to the enterprise. Presenters include data scientists from Facebook, Salesforce, Uber, and more, who will share the most challenging problems they’ve faced and what they’ve learned. Wrangle will debut this Fall, on October 22, in San Francisco.

Registration for Wrangle is currently open by invitation only, with public access available soon.

Big data technologies are critical tools for data scientists. No matter what the use case or how complex the problem is, Cloudera is ensuring data scientists can easily leverage the power of Hadoop, no matter what their preferences are for tools.

 

Sign up for the free insideBIGDATA newsletter.

Speak Your Mind

*

Comments

  1. “Hadoop has evolved dramatically over the last decade, from a batch processing tool to an entire ecosystem that powers most of today’s information architecture as well as traditional BI workloads,” said Wes McKinney, a software engineer at Cloudera and the creator of Python pandas. ”

    Come on Mr. McKinney, what planet are you from? You cannot seriously believe as of today that Hadoop powers most of today’s information architecture and traditional BI workloads. You’re a big data guy Mr. McKinney, let’s see some data to back that up. I won’t get a response to this because your data is going to show that the VAST majority of today’s information architecture and traditional BI workloads are on Oracle, DB2, SQL Server, etc. Perhaps not in super fantasy “McKinneyland” where unicorns breath fire and shoot rainbows out of their behinds, but in the actual real world.