Sign up for our newsletter and get the latest big data news and analysis.

Ask a Data Scientist: Confounding Variables

Welcome back to our series of articles sponsored by Intel – “Ask a Data Scientist.” Once a week you’ll see reader submitted questions of varying levels of technical detail answered by a practicing data scientist – sometimes by me and other times by an Intel data scientist. Think of this new insideBIGDATA feature as a valuable resource for you to get up to speed in this flourishing area of technology. If you have a data science question you’d like answered, please just enter a comment below, or send an e-mail to me at: daniel@insidehpc.com. This week’s question is from a reader who asks for an explanation of confounding variables and why they’re important in data science projects.

Q: Can you explain confounding variables and why they’re important?

A: The Merriam-Webster dictionary defines “confound” as “to fail to discern differences between: mix up.” Translating this generic definition to the field of data science we see that a confounder is a variable that is correlated with both the response variable and the predictor variables. Confounders can have a large effect in the accuracy of a machine learning problem. For instance, in a common regression problem confounders can change the regression line or even change the sign of the line.

Two variables are confounded when their effects cannot be separated from each other. When approaching a data science project, this problem is encountered when there is a variable other than the predictor variable that may have caused the effect being studied. The variable causing the confounding reduces the internal validity of the analysis in that one cannot say for sure that the predictor variable caused the effect. This variable changes with the predictor but was not intended to do so. As a result, the effect cannot be attributed to the predictor variable but may well have been caused by the other variable, the confounder.

In some situations, confounded variables are difficult to avoid. Consider, for example, an experimenter effect. Most participants who voluntarily participate in user studies wish the researchers well and hope they succeed. If they know which condition the experimenters favor, they may evaluate it more positively. To avoid having confounded variables, it is important to take possible bias into account, to make sure participants are assigned randomly to experimental conditions and to verify that the predictor variable is the sole element that can be causing the effect.

Let’s consider an example of estimating the causal effect of advertising on sales. The difficulty is that there are many confounding variables, such as seasonality or weather conditions that cause both increased ad exposures and increased purchases by consumers. Consider the case of an advertising manager who was asked why she thought her ads were effective. In response she came up with a chart that showed that every December she increased her ad spend and sure enough, purchases go up. Of course, in this case seasonality can be included in the model. However, generally there will be other confounding variables that affect both exposure to ads and the propensity of purchase which makes causal interpretations of observed relationships problematic.

Several approaches can be taken to avoid losing accuracy due to confounded variables. The best approach is to design the project such that confounding variables are avoided. Confounders can sometimes be detected by careful exploration of the data set, e.g. using methods of exploratory data analysis (EDA). When this is not possible, other precautions can be taken that would allow the data scientist to draw valid conclusions. First, one can use demographic data to measure possible confounding. When conducting a consumer study, it is useful to collect additional demographic data so that expected confounding can be objectively evaluated. Systematic differences in such variables between conditions would indicate confounding variables. However, if experimental groups do not differ on these measures, the analysis is strengthened. For example, when evaluating a weight loss system, researchers could collect information about education levels, and attitudes toward healthy living. They could then compare whether there are systematic differences between the experimental groups with regard to these variables.

Another approach to avoid making predictions based on confounded variables is to include complementary outcome measures in the study. When such complementary measures are used, contradictions in their outcomes may be an indication that there are confounded variables. Other designs and analyses can take potential confounding into account and even use it. Multivariate statistical analysis can be used to take the effect of confounded variables into account.

If you have a question you’d like answered, please just enter a comment below, or send an e-mail to me at: daniel@insidehpc.com.

Data Scientist: Daniel D. Gutierrez – Managing Editor, insideBIGDATA

Comments

  1. Sujay Mallesh says:

    Hi,
    What are the necessary skills and traits a data scientist should posses?
    Regards
    Sujay

Leave a Comment

*

Resource Links: