To help our audience leverage the power of machine learning, the editors of insideBIGDATA have created this weekly article series called “The insideBIGDATA Guide to Machine Learning.” This is our eighth and final installment, “Production Deployment Environments for R.”
R and Hadoop
In order to take advantage of the benefits of distributed computing, Revolution Analytics has created the RevoScaleR package, providing functions for performing scalable and extremely high performance data management, analysis, and visualization. Hadoop provides a distributed file system and a MapReduce framework for distributed computation. The data manipulation and analysis functions in RevoScaleR are appropriate for small and large datasets, but are particularly useful in three common
- to analyze data sets that are too big to fit in memory and,
- to perform computations distributed over several cores, processors, or nodes in a cluster, or
- to create scalable data analysis routines that can be developed locally with smaller data sets, then deployed to larger data and/or a cluster of computers.
The RevoScaleR high performance analysis functions are portable. The same high performance functions work on a variety of computing platforms, including Windows and RHEL workstations and servers and distributed computing platforms such as Hadoop, IBM Platform LSF, and Microsoft HPC Server. So, for example, you can do exploratory analysis on your laptop, then deploy the same analytics code on a Hadoop cluster. The underlying RevoScaleR code handles the distribution of the computations across cores and nodes, so you don’t have to worry about it.
The architecture of the RevoScaleR package provides an under-the-covers example of scalable analytics in Hadoop:
- A master process is initiated to run the main thread of the algorithm.
- The master process initiates a Hadoop MapReduce job to make a pass through the data.
- Parallel and distributed Hadoop “Mapper” tasks produce “intermediate results objects” for each chunk of data. These are merged using a Hadoop “Reducer” task.
- The master process examines the results. For iterative algorithms, it decides if another pass through the data is required. If so, it initiates another MapReduce job and repeats.
- When complete, the final results are computed and returned.
When running “inside” Hadoop, the RevoScaleR analysis functions process data contained in the Hadoop Distributed File System (HDFS). This “inside” architecture provides the greatest scalability. HDFS data can also be accessed directly from RevoScaleR, without performing the computations within the Hadoop framework; this is known as a “beside” architecture. The “beside” architecture is often used when processing small to medium data volumes.
R and Teradata
Another production deployment option available for RRE uses massively parallel analytical processing inside the Teradata platform. The highest possible performance is achieved because RRE for Teradata includes a library of PEMAs (included with all versions of RRE and usable on a wide variety of big data platforms in addition to Teradata). PEMAs are pre-built, external-memory, parallelized versions of the most common statistical and predictive analytics algorithms and run directly in parallel on the Teradata nodes. These PEMAs run in-database, so that not only is analytical speed improved, but data is also analyzed in-place in the database, eliminating data movement delays and latency. As in the Hadoop case, RRE supports both an “inside” and a “beside” Teradata architecture.
R in the Cloud
As an alternative to an on premise hardware solution to run a production deployment of an R-based machine learning solution, companies can now deploy to the cloud by using a hosted version of RRE on Amazon Web Services (AWS). RRE is available through the AWS Marketplace and offers scalable predictive analytics designed to analyze data sets of various sizes. The data can be stored in Amazon S3 and Amazon Relational Data Service (RDS). An attractive aspect of a cloud implementation is that you only pay as you go, just for capacity you need with no long-term commitments. This alternative is attractive to start-up companies wishing to avoid large up-front costs for data centers and commodity servers. It’s also attractive for companies with spiky or cyclical demand for analytics.