 # Supervised Machine Learning

To help our audience leverage the power of machine learning, the editors of insideBIGDATA have created this weekly article series called “The insideBIGDATA Guide to Machine Learning.” This is our fifth installment, “Supervised Machine Learning.”

Supervised Learning

Supervised machine learning is the type of statistical learning most often associated with data science since it offers a number of methods for prediction namely regression and classification.

Regression is the most common form of supervised learning. In regression, there is a response quantitative variable, such as a systolic blood pressure reading of a hospital patient, based on a series of feature variables such as the patient’s age, gender, BMI, and blood sodium levels. The relationship between systolic blood pressure and the feature variables in the training set would provide a predictive model. The model is built using complete observations which provide the value of the response variable as well as the feature variables.

Open Source R has algorithms to implement regression such as the linear model lm(), regression trees with tree(), and ensemble methods with randomForest(). In a nutshell, these algorithms implement a statistical process for estimating the relationships among variables and are widely used for prediction and forecasting.

RRE has big data versions of regression algorithms including rxLinMod() for fitting linear regression models, rxPredict() to compute fitted values and model residuals, as well as regression tree support with rxDTree() which fits tree-based models using a binning-based recursive partitioning algorithm with a numeric response variable and rxDForest() which is an ensemble of decision trees where each tree is fitted to a bootstrap sample of the original data. These algorithms are designed to work with arbitrarily large data sets.

Classification is another popular type of supervised learning. In classification, there is a response categorical variable, such as income bracket, which could be partitioned into three classes or categories: high income, middle income, and low income. The classifier examines a data set where each observation contains information on the response variable as well as the predictor (feature) variables. For example, suppose an analyst would like to be able to classify the income brackets of persons not in the data set, based on characteristics associated with that person, such as age, gender, and occupation. This is a classification task that would proceed as follows: examine the data set containing both the feature variables and the already classified response variable, income bracket. In this way, the algorithm learns about which combinations of variables
are associated with which income brackets. This data set is called the training set. Then the algorithm would look at new observations for
which no information about income bracket is available. Based on the classifications in the training set, the algorithm would assign classifications to the new observations. For example, a 58 year old female controller might be classified in the high-income bracket.

Open Source R has a number of classification algorithms such as logistic regression with glm(), decision trees with tree(), and ensemble methods with randomForest(). In a nutshell, classification is the process of identifying to which of a set of categories a new observation belongs, on the basis of a training set of data containing observations whose category membership is known.

RRE has big data versions of classification algorithms including logistic regression using rxGlm() or the optimized rxLogit() for modeling data with a binary response variable, as well as classification tree support with rxDTree(). Also included for classification is RRE’s Decision Forest algorithm rxDForest(). These algorithms are designed to work with arbitrarily large data sets.

The next article in this series will focus on Unsupervised Learning. If you prefer you can download the entire insideBIGDATA Guide to Machine Learning, courtesy of Revolution Analytics, by visiting the insideBIGDATA White Paper Library.