Paxata Announces Enhancements to its Adaptive Data Preparation™ Application Based on Spark

Print Friendly, PDF & Email


Paxata, provider of an interactive, self-service Adaptive Data Preparation™ solution at scale, announced continued market success of its high-performance platform, thanks to ongoing adoption and innovation around Apache Spark v1.3. Spark-optimized capabilities within the platform have been generally available since October 2014.

The entire enterprise landscape is dramatically shifting with disruptive technologies which are fundamentally changing the cost-to-computational performance ratio,” said Prakash Nanduri, Co-Founder and CEO of Paxata. “A year and a half ago, we recognized how data preparation enabled by Spark could deliver transformational business value with unprecedented economics which is why we made the commitment to develop our entire solution from the ground-up on Apache Spark while doubling down on being part of the Hadoop ecosystem. For the past six months, all of our customers, whether using our solution on-premise or in the Amazon Web Services cloud, have benefitted from that decision with the ability to prepare data interactively in an elastic scale-up-and-out manner at an unprecedented cost-to-performance ratio.”

Paxata’s platform, which runs on the Cloudera distribution of Hadoop, features a data preparation engine on Spark v1.3, which has been enhanced with the following new capabilities:

  • On-line aggregations: All aggregates (average, count, first, last, max, min, median, sum, variance standard deviation) are now computed in an on-line fashion, which dramatically reduces the amount of memory required by each individual Spark worker while significantly increasing the responsiveness, performance and scalability of the system.
  • Enhanced Data-Prep Specific RDDs: Enhanced data preparation specific RDDs for join detection, join execution, clustering, and dynamic filtering continue to extend the computational backend for Paxata’s market-defining IntelliFusion™ capabilities.
  • Enhanced Persistent Columnar Caching: On each worker node of a Paxata Spark cluster, proprietary on disk data structures allow for probing data without bringing it all into memory. The columnar format is now optimized for both key-based and sequential access on a per column basis, which significantly improves scan efficiency for operations that traverse all values of a column like aggregations and sorts.
  • Optimizing Compiler: Paxata’s proprietary optimizing compiler has been significantly enhanced to take advantage of the new on-line aggregations, RDDs, and columnar caching to generate highly efficient pipeline transformation plans that minimize the number of columns touched and data that needs to be shuffled across the cluster. The compiler converts scripts into a naïve abstract syntax tree, which is compiled into an optimized logical plan.

Del Monte insists on only adopting technologies which help us achieve the greatest business impact,” said Timothy Weaver, CIO of Del Monte. “There is a big difference between claiming to ‘work on Spark’ and actually delivering a solution designed to fully exploit the processing power available in Spark. We have been impressed with Paxata’s development efforts, as they have delivered a truly optimized self-service data prep solution for our business that scales elastically in cloud environments like AWS, all of which allows us to get greater value from our investment.”

For more details, visit


Sign up for the free insideBIGDATA newsletter.

Speak Your Mind