A new book, “ Why: A Guide to Finding and Using Causes ,” by Stevens Institute of Technology assistant professor of computer science Samantha Kleinberg is a necessary addition to any data scientist’s bookshelf as it helps bring focus to the dreaded “correlation does not imply causation” conundrum that affects our understanding of data-centric problems.
The best outcome of Big Data analytics, or of any computational model, is a number of correlations each with a level of confidence that the correlation holds true in the real world or at least the world represented by the data. But to determine if a correlation is true in the real world, it must be verified empirically. This can be viewed as the First Law of Data Science and a good reason why this book is so important for data scientists. It applies to all data-driven modeling and analysis. It is a fundamental law of science, of the Scientific Method.
Since there is a strong, direct correlation between rain and umbrellas being used one might conclude that using an umbrella causes rain. Or does rain cause umbrellas to be used? Can data analysis advise which is true? Unfortunately, correlations derived by data-driven models do not imply causation. At best data analysis could suggest a higher probability of rain causing umbrellas to be used. Much depends on the analytical model, on the data, and on data curation in terms of what data were used and how the data were prepared for analysis. A more detailed data analysis revealed that rain typically precedes umbrellas being used leading to the obvious conclusion that it is not umbrellas that cause rain.
Richard Feynman, in his famous “Cargo Cult Science” speech at the 1974 Caltech commencement, challenges us as scientists to confront our confirmation biases with “The first principle is not to fool yourself, and you are the easiest person to fool.”
With the importance of causality rising in importance in fields like data science and machine learning, Kleinberg has written a very timely book on causes that goes far beyond the usual cognitive bias and correlation-is-not-causation material. She demonstrates how to tease out causes from masses of data, and describes experiments that can help verify or disprove causation, and talks about how to move from causality to decision-making. All of the techniques are imperfect, leaving plenty of room for human judgment.
The book is well-written and enjoyable. The author acknowledges that the problem of defining causality has not been solved. She advocates for continued experimental and observational studies, as well as methods of overcoming or limiting bias in determining and applying causation. Further, she is careful to point out the limits of causality in determining policy. Big data can lay out many patterns, but it takes domain knowledge coupled with common sense to see which ones can even be considered as causality. Clearly, it takes a sense for political and ethical calculus to understand what should be done with the patterns that are found. Kleinberg covers all this thoroughly, which makes her book valuable beyond the technical material discussed in it.
The book has 10 chapters that introduce the reader to the field of causality (which involves both experimental philosophy and computer science) and then discusses it in terms of such interesting topics as the psychology of causality, correlation, time, observation, computation, experimentation, explanation, and action. I’ve already recommended the book to my introductory data science students.
Sign up for the free insideBIGDATA newsletter.