Defining the Data Science Landscape

At events, in meetings and in general conversation with people, it’s struck me that many seem to use data science, machine learning and artificial intelligence interchangeably. And while in passing that’s okay, there are distinctions between each that make them very different. Here, we look at how to define each of those three categories and why they’re different.

Data science is the craft of turning data into action. Data is being generated and, perhaps more importantly, digitally captured at outstanding new levels. However, abundant data only represents potential value. It has to be mined, refined and harvested. Data science is the process of extracting information, understanding and learning from raw data to inform decision making in a proactive and systematic fashion that can be generalized. A key aspect of data science is the utilization of the scientific method to form and challenge hypotheses to validate conclusions about underlying patterns in data.

Practicing data science requires the combining of a diverse set of skills. Data scientists need to be able to query and manipulate large swaths of data, so a strong computer science background is a must. Additionally, familiarity with mathematics and statistics help form a strong understanding of the algorithms commonly deployed and tuned. Combining a lot of computing power and sophisticated algorithms is called data mining. However, a major hazard of this approach is the potential to mistake noise for signal. Domain expertise is a helpful component in the verification of causal and logical relationships in models and conclusions.

In general, a data scientist needs to know more about statistics than an average programmer, more programming than an average statistician, and be able to apply both skills to solve business problems.

The overall objective of data science may seem straightforward, but implementation is a very complex process and involves a number of steps before the value of a data science product can be observed. Here’s what that looks like:

Business Understanding
Data Understanding
Data Preparation
Modeling
DS team evaluation
Stakeholder evaluation
Deployment

In the modeling stage, a data scientist will look to apply statistical learning techniques (otherwise known as machine learning techniques or algorithms) to tease out details in the underlying raw data. Machine learning involves the utilization of statistical computing to understand tendencies, patterns, characteristics, attributes and structure in the underlying data, so as to inform decisions in some future state on new observations. Rather than hand-coding software with specific instructions and custom rules, machine learning algorithms are “trained” on large pools of data and “induce” how to perform a specified task. It’s worth noting that machine learning doesn’t always imply “super-human” performance. In fact, it usually doesn’t. Machine learning techniques win out over human alternatives on scalability and cost. A good machine learning task is something that happens so many times that it’s not cost effective to have a human do it.

Popular machine learning algorithms include Linear Regression, Logistic Regression, Decision Tree, Support Vector Machines (SVM), Naive Bayes, K- Nearest Neighbor, K-Means, Random Forest, Principal Component Analysis (PCA), and Gradient Boost & Adaboost, Neural Networks and Deep Learning Neural Networks (“Deep Learning”). A data scientist may use these techniques in isolation or in combination.

Machine learning can be generalized into three broad style buckets depending on data availability, the nature of the problem and desired objectives:

Supervised Learning– Involves creating a model from a training set with a series of inputs and labels. For example, if you wanted to develop a model to predict sale prices of homes, you would start with a training set composed of past sold homes, their attributes (location, bedrooms, etc.) and their final sale prices (the labels).
Unsupervised Learning– Creates a model using a series of data inputs but with no labels. Unsupervised learning seeks out underlying structure, distribution or associations in data. You might use cluster analysis to group plant life into discrete buckets based on similarity of characteristics, for example. Another example might be highlighting anomalous temperature readings from a temperature probe.
Reinforcement Learning– Focuses on developing agents that learn from interacting with an environment, getting better with more interactions by looking to maximize a reward signal. Autonomous driving vehicles are trained using reinforcement learning, for example.

Artificial intelligence (AI) is a term with a lot of history and has accumulated many associations. It generally serves as a “catch-all” for all things machine learning and robotics. It was first coined in the middle of the 20th century as technologists became excited about the concept of computer programs replicating human cognitive abilities. Aside from some notable milestones (IBM’s Deep Blue defeating Kasparov, for example), it wasn’t until recently that the promise of artificial intelligence has started to be realized.

Advancements in machine learning techniques, such as deep learning neural nets, innovations in computing (GPUs, for example) and the availability of large training data sets have made it possible for computer programs to take on more human-like tasks such as image processing, speech comprehension, natural text recognition, playing board games and even driving cars! Many of these advancements in AI fall under the category of Applied or Narrow AI and often utilize deep learning neural nets to build reinforcement learning models (see above). The term Cognitive Computing is sometimes used interchangeably with narrow AI.

Essentially, there is still a lot of domain-specific training, tuning and data for a program to excel at a particular “human” task. For example, the program that can drive a car cannot quickly be repurposed to compose music. Each program has to be laboriously customized for different applications. The holy grail in data science right now is Broad or General AI systems that can more genuinely replicate human reasoning and learning. Theoretically, we could have one AI system that can be applied to multiple tasks with minimal switching costs and reconfigurations.

In closing, it’s important to note the distinctions in terminology in the data science landscape. Perhaps most notably, people must be aware of the differences between data science, machine learning and artificial intelligence. The three shouldn’t be used interchangeably due to fundamental differences in their definitions and in what they deliver. I hope this outline of the data science landscape is helpful in defining the various terms at play and how they come together.

About the Author

Manny Bernabe leads and develops strategic relationships for Uptake’s Data Science team. He works with industry partners, universities advisors and business leaders to understand the opportunities for aspiring data-driven organizations. Manny brings a background from the Financial Services sector and specifically asset management. He has deep expertise in research and deployment of quantitative strategies in exchange traded funds (ETFs).

Sign up for the free insideBIGDATA newsletter.

Comments

Pat Hennel says

August 8, 2017 at 8:29 am

It’s true that data scientists need to have a breadth of knowledge. They need to be able to see the big picture when it comes to businesses while also possessing a finely honed mathematical and statistical ability. To make things even more complex, each data science team is different, so the skills that the data scientists need to possess may also differ.

Defining the Data Science Landscape

Sponsored Guest Articles

Optimizing Performance and Cost Savings for Elastic on Pure Storage

White Papers

From complexity to clarity: Harnessing the power of AI/ML and risk-informed strategies to streamline clinical data management

Speak Your Mind Cancel reply

Comments

Featured RSS Feed

More News from insideHPC

Defining the Data Science Landscape

Sponsored Guest Articles

Optimizing Performance and Cost Savings for Elastic on Pure Storage

White Papers

From complexity to clarity: Harnessing the power of AI/ML and risk-informed strategies to streamline clinical data management

Join Us On Social Media

Speak Your Mind Cancel reply

Comments

Related Posts

Featured RSS Feed

More News from insideHPC