I ran across a Tweet recently that pointed me to a discussion over on Professor Andrew Gelman’s blog, “Statistics is the least important part of data science.” Dr. Gelman is a Professor of Statistics and Political Science at Columbia University and prior Ph.D. adviser of Rachel Schutt, author of Doing Data Science which I reviewed earlier this month.
As I’ve worked in the data science arena long before that name was even used, and having gone through an aborted graduate program in computer science and mathematical statistics some years ago, I think I have a unique perspective for how Data Science manifests itself today. I see it as a confluence of disciplines: computer science, mathematical statistics, probability theory, machine learning, data analysis, and visualization. Being a theorist specializing in applications of machine learning, I can safely say that I’d be hard pressed to minimize the importance of any of these disciplines, especially statistics, in its overall impact to the to the field.
Here’s how Dr. Gelman sees it:
There’s so much that goes on with data that is about computing, not statistics. I do think it would be fair to consider statistics (which includes sampling, experimental design, and data collection as well as data analysis (which itself includes model building, visualization, and model checking as well as inference)) as a subset of data science. . . . The tech industry has always had to deal with databases and coding; that stuff is a necessity. The statistical part of data science is more of an option. To put it another way: you can do tech without statistics but you can’t do it without coding and databases … Statistics can do all sorts of things. I love statistics! But it’s not the most important part of data science, or even close.
Being a professor of statistics himself, that is a pretty self-deprecating attitude. I take exception to that view of statistics because I think it focuses on the “experimentalist” part of data science to a greater degree than appropriate. The “theorist” part that I do is of utmost importance as a precursor to production systems that incorporate my models. So I respectfully disagree, statistics plays just as important a role as computer science in the overall scheme of things for data science. Data science and machine learning is applied mathematics (statistics) at its best!