Sign up for our newsletter and get the latest big data news and analysis.

Ask a Data Scientist: Handling Missing Data

Welcome back to our series of articles sponsored by Intel – “Ask a Data Scientist.” Once a week you’ll see reader submitted questions of varying levels of technical detail answered by a practicing data scientist – sometimes by me and other times by an Intel data scientist. Think of this new insideBIGDATA feature as a valuable resource for you to get up to speed in this flourishing area of technology. If you have a big data question you’d like answered, please just enter a comment below, or send an e-mail to me at: daniel@insidebigdata.com. This week’s question is from a reader who seeks a discussion of missing data handling methods such as imputation.

Q: How do you handle missing data? What imputation techniques do you recommend?

A: Handling missing data is an important part of the data munging process that is integral to all data science projects. Incomplete observations can adversely affect the operation of machine learning algorithms so the data scientist must have procedures in place to properly manage this situation. Data imputation is one such procedure – it is the process of filling in missing values based on other data.

Common imputation methods of dealing with unknown or missing values include:

  1. Removing entire observations containing one or more unknown values
  2. Filling in unknown values with the most frequent values
  3. Filling in unknown values by exploring correlations
  4. Filling in unknown values by exploring similarities between cases

How can we systematically fill in missing values? Some machine learning algorithm implementations automatically remove observations containing missing values (which may introduce bias or affect the representativeness of the results), but in many cases you have to impute the data manually before running the function. As outlined above, there are several options here, for example you might want to delete all incomplete observations even though this will decrease the power of the analysis by decreasing the effective sample size.

The simplest imputation method is based on treating every variable individually, ignoring any interrelationships with other variables. One option involves replacing any missing value with the mean or median of that variable for all other observations, which has the benefit of not changing the sample mean for that variable. As an example, consider the following sequence of values:

1,2,3,1,3,1,,,,3

The sequence has three missing values all of which could be replaced by 2, the median of the non-missing values.

Mean imputation, however, diminishes any correlations involving the variable(s) that are imputed. This is because, in observations with imputation, there is guaranteed to be no relationship between the imputed variable and any other measured variables. Thus, mean imputation has some attractive properties for univariate analysis but becomes problematic for multivariate analysis.

In machine learning, it is sometimes possible to train a classifier directly over the original data without imputing it first. This has been shown to yield better performance in cases where the missing data is structurally absent, rather than missing due to measurement noise.

There are many more advanced imputation methods design to address the problem of missing data. These methods exploit interrelationships between variables and impute multiple values rather than a single value. In general, missing data and their treatment is an important data quality issue for data scientists. A suitable solution depends on the computational resource especially for big data scale data sets, as well as tolerance to errors in approximating missing values.

If you have a question you’d like answered, please just enter a comment below, or send an e-mail to me at: daniel@insidebigdata.com.

 

Daniel D. Gutierrez – Managing Editor & Resident Data Scientist, insideBIGDATA

Comments

  1. Piyapong Khumrin says:

    I read you article about how to handle missing data from http://insidebigdata.com/2014/10/29/ask-data-scientist-handling-missing-data/ which is very useful for me to help me to find a solution how to handle on my PhD research.

    In my research, I train a machine learning model with clinical cases to predict diagnoses and use the model to guide medical students along the way that they try to solve unknown cases in a game-based learning tool.

    I could get a good predictive model and reasonable prediction on the learning system until I discover some predictions are quite odd.

    I tried to find the reason and discovered that there are some features that contained 100% missing data for one target class.

    I realised that it’s always missing because doctors never investigate that value becuase it dose not give any benefit for that diagnosis.

    However, when I create a scenario, I have to add that feature value because it should be available if a user want to know the information.

    Therefore, when a user choose that feature and the unknow instance contain the information of that feature, I use the machine learning model to predict the unknown case for diagnosis X. Because in a training data, that feature is 100% for diagnosis X. When I try to get a prediction of the unknow case with the information of that feature. The prediction drops to zero because the model never learned with existing data from that feature.

    I tried to read articles how to handle missing data but it’s for handling some missing data but not 100% missing data.

    Would that be possible to ask your advice on how to handle with 100% missing data?

Leave a Comment

*

Resource Links: