I recently ran across a thought-provoking post on the USC Anneberg Innovation Lab blog – “Why Do We Need Data Science when We’ve Had Statistics for Centuries.” With all the debate of late surrounding the relatively new “data science” term, I’ve been thinking a lot about this question, so I thought I’d analyze this notion here on insideBIGDATA by picking apart the article. I’d love to hear your take on this, so feel free to leave a note.
Here are some excerpts from the article along with my commentary:
Use of the term data science is increasingly common, as is big data … but what does it mean? Is there something unique about it? What skills do data scientists need to be productive in a world deluged by data? What are the implications for scientific inquiry?”
This is the big question, how does data science differ from statistics and computer science? I think the answer is related to big data, but not exclusively so. Big data does require the use of a very different technology stack than used previously with statistical analysis. Hadoop represents a paradigm shift to address these needs. So a statistician from 20 years ago, would not be equipped to deal with doing analysis on huge data sets on a time-scale that’s often required by today’s business applications.
Does data science involve scientific inquiry? You betcha! I see data science in the same light as say the data analysis phase of an astrophysics or genomics project. You’re applying the scientific method with data collected using scientific principles. In a previous life, I carried out the scientific method with astrophysical data sets on a routine basis. Now that I’m doing business-oriented data science, I don’t really see a difference.
… defines data science as being essentially the systematic study of the extraction of knowledge from data. But analyzing data is something people have been doing with statistics and related methods for a while. Why then do we need a new term like data science when we have had statistics for centuries? The fact that we now have huge amounts of data should not in and of itself justify the need for a new term.”
This is the same observation many people are seeing these days, and the question is quite valid. As I stated above, the 3 V’s of big data definitely contribute to the need for a new science of data. But that’s not the end of it. It is also the use of disparate data sets including social media, the speed of analysis, near real-time deployment requirements, and the advancement of the fields of machine learning and visualization also contribute to the new data science. There really is something new going on!
In short, it’s all about the difference between explaining and predicting. Data analysis has been generally used as a way of explaining some phenomenon by extracting interesting patterns from individual data sets with well-formulated queries. Data science, on the other hand, aims to discover and extract actionable knowledge from the data, that is, knowledge that can be used to make decisions and predictions, not just to explain what’s going on.”
I really like this differentiation. There is definitely an engineering component of modern data science. The work of data scientists ultimately becomes part of production systems; think Amazon’s or Netflix’s recommender systems. This aspect is relatively new – the actionable part. Many new start-ups are driven by actionable knowledge from machine learning applications. This is light-years beyond yesterday’s data analysis.
The raw materials of data science are not independent data sets, no matter how large they are, but heterogeneous, unstructured data set of all kinds, – e.g., text, images, video. The data scientist will not simply analyze the data, but will look at it from many angles, with the hope of discovering new insights.”
This is another huge reason any today’s data science differs from what was done previously. The variety of big data goes way beyond the data warehouse that was the common denominator a decade ago. Diversity of data has led data science in a number of new and exciting new directions; think sentiment and credibility analysis algorithms.
Most of us are trained to believe theory must originate in the human mind based on prior theory, with data then gathered to demonstrate the validity of the theory. Machine learning turns this process around. Given a large trove of data, the computer taunts us by saying, If only you knew what question to ask me, I would give you some very interesting answers based on the data. Such a capability is powerful since we often do not know what question to ask. . .”
So true! Unsupervised statistical learning, coupled with the processing power to yield insightful clusters, allows us to ask new questions. In days gone by, these questions largely remained unanswered.
Data scientists should also have good computer science skills, – including data structures, algorithms, systems and scripting languages, – as well as a good understanding of correlation, causation and related concepts which are central to modeling exercises involving data.”
As I mentioned above, the marriage of statistical methods and computer science is really the crux of the new discipline of data science. It is for this reason that I believe data science is justified as a distinct field of study. Further, I see it evolving quickly, especially in the past couple of years. The next 5 years should be exciting to be a data scientist.
Like computing, one of the most exciting part of data science is that it can be applied to many domains of knowledge. But, doing so effectively requires domain expertise to identify the important problems to solve in a given area, the kinds of questions we should be asking and the kinds of answers we should be looking for, as well as how to best present whatever insights are discovered so they can be understood by domain practitioners in their own terms.”
This declaration is very well articulated. This is what I love most about data science – it can be applied to any field. A good data scientist has experience working with domain experts to pick their brains on critical parameters of the business. Sure, a data scientist with specific knowledge of say, agriculture, would be ideal but not necessarily. We’re generally pretty quick studies!
Daniel – Managing Editor, insideBIGDATA
Sign up for the free insideBIGDATA newsletter.