To help our audience leverage the power of machine learning, the editors of insideBIGDATA have created this weekly article series called “*The insideBIGDATA Guide to Machine Learning.*” This is our fourth installment, “Data Munging, Exploratory Data Analysis, and Feature Engineering.”

**Data Munging**

The next phase of a machine learning project involves a process called “data munging.” It is often the case where the data imported into the

R environment is inconvenient or incompatible with machine learning algorithms, so with data munging (also known as data transformation) the data can be massaged into a more hospitable form. Data munging cannot be taken lightly as many times it can consume up to 80% of the entire machine learning project. The amount of time needed for a particular project depends on the health of the data: how clean, how complete, how many missing elements, etc. Open source R has many mechanisms and packages to facilitate data transformations and cleaning, e.g. **dplyr**, **reshape2**, **lubridate**, etc. The specific tasks and their sequence should be recorded carefully so you can replicate the process. This process becomes part of your data pipeline. Here is a short list of typical data munging tasks, but there potentially are many more depending on the data:

- Data sampling
- Create new variables
- Discretize quantitative variables
- Date handling (e.g. changing dates stored as integers to R date objects)
- Merge, order, reshape data sets
- Other data manipulations such as changing categorical variables to multiple binary variables
- Handling missing data
- Feature scaling
- Dimensionality reduction

Open source R has its limitations with respect to performing data munging on large data sets. Sometimes a good option is to perform transformations outside the R environment, possibly at the data warehouse level using ETL steps based on SQL stored procedures. Once the data is in shape, then it can be imported into R.

Alternatively, RRE has big data functions for data transformations like **rxSplit()** which can be used to minimize the number of passes through a large data set. It can efficiently split a large data set into pieces in order to distribute it across the nodes of a cluster. You might also want to split your data into training and test data so that you can fit a model using the training data and validate it using the test data. RRE also can perform sorting with **rxSort()**, merging with **rxMerge()**, and missing value handling with big data features such as the **removeMissing** argument for the **rxDTree()** algorithm.

**Exploratory Data Analysis**

Once you have clean, transformed data inside the R environment, the next step for machine learning projects is to become intimately familiar with the data using exploratory data analysis (EDA). The way to gain this level of familiarity is to utilize the many features of the R statistical environment that support this effort — numeric summaries, plots, aggregations, distributions, densities, reviewing all the levels of factor variables and applying general statistical methods. A clear understanding of the data provides the foundation for model selection, i.e. choosing the correct machine learning algorithm to solve your problem.

Open source R has many mechanisms for EDA including **hist()** for histograms, **boxplot()** for boxplots, **barplot()** for barplots, **plot()** for scatterplots, **heatmap()** for heatmaps, etc. Using these tools allows for a deep understanding of the data being employed for machine learning. This understanding serves the purpose of feature engineering.

The open source R has difficulty performing EDA on big data data sets, however RRE has a number of big data functions that process data in chunks using the specified compute context. For example **rxGetVarInfo()** can find potential outliers, **rxSummary()** show summaries for continuous and categorical variables, **rxHistogram()** for density estimation, **rxLinePlot()** for drawing line plots.

**Feature Engineering**

Feature engineering is the process of determining which predictor variables will contribute the most to the predictive power of a machine learning algorithm. There are two commonly used methods for making this selection – the *Forward Selection Procedure* starts with no variables in the model. You then iteratively add variables and test the predictive accuracy of the model until adding more variables no longer makes a positive effect. Next, the *Backward Elimination Procedure* begins with all the variables in the model. You proceed by removing variables and testing the predictive accuracy of the model.

The process of feature engineering is as much of an art as a science. Often feature engineering is a give-and-take process with exploratory data analysis to provide much needed intuition about the data. It’s good to have a domain expert around for this process, but it’s also good to use your imagination. Feature engineering is when you use your knowledge about the data to select and create features that make machine learning algorithms work better.

One problem with machine learning is too much data. With today’s big data technology, we’re in a position where we can generate a large number of features. In such cases, fine-tuned feature engineering is even more important.

The next article in this series will focus on Supervised Learning. If you prefer you can download the entire *insideBIGDATA Guide to Machine Learning*, courtesy of Revolution Analytics, by visiting the insideBIGDATA White Paper Library.