I recently received the following question on data science methods from an avid reader of insideBIGDATA who hails from Taiwan. I think the topics are very relevant to many folks in our audience so I decided to run it here in our *Data Science 101* channel. If you have more thoughts on the subject, or would like to contribute to the discussion, please leave a comment below.

**Question: **I’ve been reading your articles on insideBIGDATA and have learned so much!** **I started my own pet project, trying to predict whether the person has diabetes based on a few features. Here are the top 5 rows of the data set.

This is a small data set (only 768 people). I have two questions:

- 374 cases are missing ‘Insulin’. That’s more than 50% missing. I personally believe that ‘Insulin’ could be a great and important predictor, but if this many cases are missing, is it best to discard this feature completely?
- ‘SkinThickness’ is missing from 227 cases. This is a large percentage too, but I found that there’s a strong correlation between ‘SkinThickness’ and ‘BMI’ (see figure below), which makes sense too. Am I introducing data leakage if I fill in missing ‘SkinThickness’ values based on BMI? My plan is to have ranges of BMI, say 20-25, 25-30, 30-35, etc., and get the mean for each range. Depending on which range the person’s BMI falls in, I’ll fill in the missing ‘SkinThickness’ with the mean corresponding to that range. Is this a good approach?

**Answer:** Thank you for your question as I think many data science practitioners can relate to the situation you describe. Let’s take a look at part #1 first. Often you’ll commence work on a new data science project only to find the source data set incomplete, i.e. there are null values for certain predictors that can help predict the response variable value. This is where data science gets nuanced since there is no absolute right or wrong answer here. In fact, you’ll likely try a number of different approaches during the data science process.

First, you might want to delete all incomplete observations even though this will decrease the power of the analysis by decreasing the effective sample size. It also means removing observations with missing values can produce a bias in the model. Another disadvantage to this approach is that the subjects with missing values may be different than the subjects without missing values (e.g., missing values that are non-random), so you have a non-representative sample after removing the observations with missing values.

Rather than removing the incomplete observations outright, you can explore the effect of removing data by using a language mechanism to temporarily ignore incomplete observations for the operation being performed. For example, in R some functions (like cor() for computing the correlation coefficient for two variables) include the use=”complete.obs” argument that temporarily throws away the incomplete observations for the calculation.

Another obvious option is to delete the incomplete observations and then get more “complete” data! Sadly, that’s not always possible. In this case it’s hard to justify removing the ‘Insulin’ predictor altogether since it likely offers strong statistical significance in the selected model.

Now let’s examine part #2. This question frames the process of “imputation” very well. Many times, you can simulate missing continuous predictor values by using a language-based construct like the impute() function found in R’s e1071 package to set missing values to the mean or median of other data values. You can use this approach directly on the ‘SkinThickness’ variable, or you can experiment with you idea to discretize ‘BMI’ and then use the correlation you discovered between the two variables.

Please take a look at a couple of articles I wrote a few years back when we partnered with Intel on our “Ask a Data Scientist” series – “Handling Missing Data” and also “Data Leakage.”

Finally, here is a good flow chart to follow for missing data issues from an excellent article that thoroughly details the subject: “How to Handle Missing Data,” by Alvira Swalin, a Master’s student in data science.

*Contributed by Daniel D. Gutierrez, Managing Editor and Resident Data Scientist of insideBIGDATA. In addition to being a tech journalist, Daniel is also a practicing data scientist, author, educator and sits on a number of advisory boards for various start-up companies. *

*Sign up for the free insideBIGDATA newsletter.*

## Speak Your Mind