Paxata Achieves High Performance for Data Preparation Based on Apache Spark

Print Friendly, PDF & Email

paxataPaxata, provider of the first purpose-built Adaptive Data Preparation™ solution, today announced the results of its Data Preparation Performance benchmark study. The evaluation tested the performance of Paxata’s recent Spring ’15 release based on Apache Spark. Proving the power of Adaptive Data Preparation at scale and as demonstrated by continued customer and partner success, the benchmark results highlight the company’s continued focus on performance, efficiency, elasticity, connectivity and scalability.

In the benchmark, Paxata demonstrated an aggregate median response time of less than five seconds for a full spectrum of data preparation operations on datasets with up to 20 million rows and 198 columns, with many operations showing sub-second response times. This represents an overall performance improvement on all operations by over 80 percent for its Adaptive Data Preparation when compared to previous releases. The benchmark was conducted on 27 nodes with eight cores x 60 GB each in Amazon Web Services with the Paxata multi-tenant cloud offering. The full benchmark report is available at

According to the Gartner report* titled Data Preparation Is Not an Afterthought, “The iterative and explorative nature of data preparation results in a time-consuming process that demands considerable effort from data scientists and business analysts. In addition, preparation of data originating from new and diverse data sources can be challenging.”

We had a short period of time to complete a massive data migration project which required us to extract, organize and clean 30 million records being moved from a legacy environment into an SAP system,” said Matt Heinz, Head of Business Intelligence at Del Monte Foods, Inc. “The work had to be done by a non-technical team who understood the data best, and we wanted flexibility to explore and define our needs as the project evolved. The Paxata cloud solution took no time to deploy, and gave us an easy-to-use tool on a platform that scales as our data volumes and usage increase or decrease.”

The Paxata data preparation solution goes well beyond “wrangling”, “taming” or “munging” data, as it was designed to simplify how the business gathers and uses data, regardless of size or source. Built from the ground up to help business analysts, data scientists, developers, data curators, and IT teams automate, collaborate and dynamically govern the data integration, data quality and enrichment process in a self-service fashion, now with unprecedented scale, the platform allows them to quickly and confidently build the AnswerSets needed for analytics without coding, scripting, data modeling or sampling.

Paxata makes fast work of the pre-cursor steps of our analytics workflow,” said Mitchell D. Silber, Executive Managing Director at K2 Intelligence, an investigative and integrity consulting firm. “The faster we understand, shape and combine the massive data sets from email logs, transaction records, social network activity, or other sources, the faster we can help our clients.”

While the initial success of the Paxata solution was around self-service data preparation on Excel, CSV, XML and JSON files, the Spring ’15 release now makes it simple to extract and interactively prepare data from Hadoop data lakes without the requirement for customers to write MapReduce jobs to prepare data sets through sampling and executing in batch-mode. This gives Paxata partners and customers an end-to-end method that minimizes the time and effort it takes to surface and combine Hadoop and non-Hadoop data.


Sign up for the free insideBIGDATA newsletter.

Speak Your Mind