To help our audience leverage the power of machine learning, the editors of insideBIGDATA have created this weekly article series called “The insideBIGDATA Guide to Machine Learning.” This is our sixth installment, “Unsupervised Machine Learning.”
Unsupervised Learning is often considered more challenging than supervised learning because there is no corresponding response variable for each observation. In this sense, we’re working blind and the question is what kind of statistical analysis is possible? Using unsupervised techniques like clustering, we can seek to understand the relationships between the variables or between the observations by determining whether observations fall into relatively distinct groups. For example, in a customer segmentation analysis we might observe multiple variables: gender, age, zip code, income, etc. Our belief may be that the customers fall in different groups like frequent shoppers and infrequent shoppers. A supervised learning analysis would be possible if the customer shopping history were available, but this is not the case in unsupervised learning — we don’t have response variables telling us whether a customer is a frequent shopper or not. Instead, we can attempt to cluster the customers on the basis of the feature variables in order to identify distinct customer groups.
Open Source R has the k-means clustering algorithm kmeans() which uses a partitioning approach for determining clusters. K-means clustering is a technique that groups observations of quantitative data using one of several iterative relocation algorithms — that is, starting from some initial selection of clusters, which may be random, points are moved from one cluster to another so as to minimize sums of squares.
Another style of unsupervised learning is called Principal Component Analysis (PCA) and is best thought of as a dimensionality reduction technique where the number of feature variables is reduced while retaining nearly the same amount of variation. R has several algorithms to compute principal components: prcomp(), princomp(), pca(), and svd().
RRE has big data implementations of these unsupervised learning algorithms such as rxKmeans(). To do PCA on an arbitrary number of rows, RRE offers rxCovCor() and its relatives rxCov() and rxCor() that can be used to calculate an eigenvalue decomposition of a covariance or correlation matrix respectively and then use R’s princomp() algorithm. Another technique called density estimation is often included with unsupervised learning, and RRE has its rxHistogram() to provide a big data solution.
The next article in this series will focus on Production Deployment. If you prefer you can download the entire insideBIGDATA Guide to Machine Learning, courtesy of Revolution Analytics, by visiting the insideBIGDATA White Paper Library.