Sign up for our newsletter and get the latest big data news and analysis.

Computing to Match Data Processing Needs

Ramesh HariharanIn this special guest feature, Ramesh Hariharan, Co-Founder and Head of Innovation & Technology at LatentView, presents a high level view of the advancing compute resources being used for today’s data processing requirements. At LatentView, Ramesh currently work on skunk-works projects, at the intersection of machine learning, visualization, software, devices & things, and data, with the objective of helping our clients improve efficiencies, discover new business opportunities, or interact better with their customers. He’s not an expert in any of these areas. However, he understands them enough to integrate them together and deliver value to clients. Ramesh is comfortable working with a spectrum of subject areas, whether it’s calculus, linear algebra or statistics, systems dynamics, programming. or discussing the drivers of customer value with CMO’s. Prior to LatentView, Ramesh worked for Oracle Consulting and Cognizant, where he acquired deep technology, delivery and client consulting experience across geographies. He holds a Bachelor’s degree in Technology from IIT, Madras and an MBA from IIM Calcutta.

Big Data, Big Processing, Big Compute: We are at the cusp of the “IoT era”, where a significant amount of data is expected to be generated by sensors or “things” at the edges of the network. This data flows continuously or intermittently, in a bi-directional manner, in a point-to-point topology. In addition to IoT data, unstructured and semi-structured data, in the form of text, images, videos, etc., flows mostly in a single direction from source to data lakes in a batch or streaming flow.

The traditional approach to processing this big data has been to throw computing at the problem. Thanks to Moore’s Law, the cost of computation and storage is falling dramatically. Given that, enterprises are investing heavily in ingesting, processing, and analyzing all the data in the hope of reducing costs and risks, creating new business models and improving business processes.

However, there are several challenges in making this a reality. In order to understand these challenges better, let’s look at the different types of data processing. Typically, we can classify them as:

  • data management (look-up, join, set operations, sorting, filters, etc.);
  • analytics and reporting (such as summarize, group, roll-up, rank, etc.);
  • extracting rules and building models (building predictive and optimization models);
  • real-time scoring (scoring customers at real time);
  • real-time optimal decision making (making the best pricing, recommendation or bidding decisions); and
  • deep learning (high computing intensive tasks such as image classification).

As we move down the list, the complexity of computing increases.

For batch data management and analytics tasks, the industry developed a map / reduce paradigm for computing. In this paradigm, the data, rather than computing architecture, is divided into large number of pieces across different compute units. Processing is done individually on various pieces (map task) in parallel. The results are then combined together (reduce). This is a classic divide and conquer approach that works very well for data management and for some types of analytics tasks.

However, MapReduce does not work well for tasks that require all the data at one go. For example, building predictive or optimization models, even in a batch fashion. Another class of problems where MapReduce does not work well is the processing of streaming data, and graph computing.

For such problems, the approach so far has been to use specialized software such as SAS, rather than use open tools such as R. Over the last few years, Microsoft’s Revolution R provides an enhanced version of R that takes advantage of multiple threads in a processor, and this has mitigated the problems to some extent. In addition, there are other R packages that provide support for working with large data sets for certain types of predictive modeling (such as big memory).

More recently, organizations have rapidly adopted Apache Spark. The platform is very adept at handling batch as well as streaming tasks, with growing support for machine learning and graph computing tasks. Spark is evolving as a single platform with all the capabilities, ranging from batch to stream to machine learning to graph processing.

In recent years, there has been an explosion of Deep Learning algorithms. These are very high compute intensive algorithms, used in tasks such as image classification, crowd density estimation, language translation, etc. These tasks are very well suited for GPU-based computing approach.

Unlike a CPU (central processing unit) that consists of a few cores, a GPU (graphical processing unit) consists of thousands of simpler, smaller cores that can process several tasks in parallel. This makes GPU-based computing a perfect fit for deep learning tasks. Tasks that take days in a CPU-based approach can be completed in minutes using a GPU-based approach. However, we are still in early days, and there is a lot of work needed to make the applications take advantage of the GPU architecture.

Overall, there is a clear correlation between the type of analytics that are developed and the paradigm of computing that enables these analytics. Hadoop and Map/Reduce had enabled the collection and management of big data. Spark, R, Python, combined with pervasive computing power have enabled the proliferation of advanced analytics. Spark is enabling the rise of graph computing and unveiling of complex relationships within data, while making everything else easier. GPU’s have enabled the mainstreaming of deep learning methods that have led to machines making major strides in pattern recognition and other artificial intelligence tasks.

Over the next few years, we expect the emergence of even more advanced paradigms such as quantum computing, which will unleash solutions for different and more complex problems, solutions to problems that are not in the consciousness of mainstream practitioners today.

 

Sign up for the free insideBIGDATA newsletter.

Leave a Comment

*

Resource Links: