Sign up for our newsletter and get the latest big data news and analysis.

Percipient Launches SparkPLUS to Solve Apache Spark’s Out-of-memory Problems

percipient_logoPercipient, a Singapore-based startup, is launching a revolutionary solution to address the memory issues incurred by users of open source platform, Apache Spark.  By delivering unified data a priori to the Spark platform, Percipient’s SparkPLUS solution is able to multiply the platform’s computing space, thereby greatly enhancing its utility for real time and analytical applications.

Percipient’s US-based CTO, Ravi Shankar Nair, said that he is keen to encourage more companies to deploy Spark, but wanted to first ensure that Spark’s memory problems could be overcome. He said, “Spark is an extremely powerful tool and we are passionate about its potential to open up new business opportunities.  With the launch of SparkPLUS, companies can now leverage Spark’s capabilities without worrying about how to save on memory.”

Spark was developed at the University of California and open sourced in 2010. By 2015, it had become one of Apache’s most popular projects, driven by its ability to offer ultra-fast cluster computing for big data.  However, complaints about Spark’s out-of-memory errors have continued to surface, especially when dealing with disparate data sources and multiple join conditions. Under such circumstances, Spark’s memory utilization, which is based on the complex partitioning of data, becomes vastly elevated, and the platform quickly runs out of memory.

SparkPLUS offers an elegant and simple solution to this problem. By using a proprietary in-memory data access layer, disparate data sources can be pre-joined via standard ANSI query language. This aggregated data is then delivered to the Spark cluster via a JDBC connector. By so doing, the SparkPLUS solution essentially bypasses the technical challenges of partitioning, and instead frees up the Spark platform for the high speed computational functions that it was designed to fulfill. No memory configuration tuning is required.

This SparkPLUS solution has several other advantages. With more memory available, concurrency across a large number of users is now possible. This was previously a stumbling block to using Spark at an enterprise level. As a result of the direct access to multiple data sources, SparkPLUS also reduces the need to persist the resulting aggregated data in a separate store.

Percipient will be introducing SparkPLUS to clients who are interested in taking advantage of Spark’s freely-available library of machine learning algorithms. SparkPLUS also facilitates the integration of streaming data with batch data. This means that businesses can apply analytics to a combination of real time transaction data and customer profile data in order to, for example, personalise marketing offers or detect credit risk. SparkPLUS supports multi-user, large scale data environments.

 

Sign up for the free insideBIGDATA newsletter.

Leave a Comment

*

Resource Links: