To help our audience leverage the power of machine learning, the editors of insideBIGDATA have created this weekly article series called “*The insideBIGDATA Guide to Machine Learning.*” This is our second installment, “Introduction to Machine Learning.”

Supervised machine learning is typically associated with prediction where for each observation of the predictor measurements (also known as feature variables) there is an associated response measurement (also known as the class label). Supervised learning is where a model is fit that relates the response to the predictors, with the aim of accurately predicting the response for future observations. Many classical learning algorithms such a linear regression and logistic regression, operate in the supervised domain.

Unsupervised machine learning is a more open-ended style of statistical learning. Instead of using labeled data sets, unsupervised learning is a set of statistical tools intended for applications where there is only a set of feature variables measured across a number of observations. In this case, prediction is not the goal because the data set is unlabeled, i.e. there is no associated response variable that can supervise the analysis. Rather, the goal is to discover interesting things about the measurements on the feature variables. For example, you might find an informative way to visualize the data, or discover subgroups among the variables or the observations.

One commonly used unsupervised learning technique is k-means clustering that allows for the discovery of “clusters” of data points. Another technique called principal component analysis (PCA) is used for dimensionality reduction, i.e. reducing the number of feature variables while maintaining the variation in the data, in order to simplify the data used in other learning algorithms, speed up processing, and reduce the required memory footprint. There are a number of other steps in the data science pipeline (see figure below) that contribute to the success of a machine learning project: understanding the problem domain, data access, data munging, exploratory data analysis, feature engineering, model selection, model validation, deploy, visualization and communicate results. We’ll briefly take a look at these steps in this guide.

Open source R is the choice of an increasing number of data science practitioners worldwide, however, there are significant limitations to the open source version of R in terms of capacity and performance for production systems. Big data is largely about volume of data, so R needs a more robust infrastructure to manage this environment. This is where a commercial product like RRE comes into play. This guide will illustrate the use of both open source R and commercial R for machine learning applications.

The next article in this series will focus on R – the data scientist’s choice for data access. If you prefer you can download the entire *insideBIGDATA Guide to Machine Learning*, courtesy of Revolution Analytics, by visiting the insideBIGDATA White Paper Library.