Ask a Data Scientist: Unsupervised Learning

Print Friendly, PDF & Email

Welcome back to our series of articles sponsored by Intel – “Ask a Data Scientist.” Once a week you’ll see reader submitted questions of varying levels of technical detail answered by a practicing data scientist – sometimes by me and other times by an Intel data scientist. Think of this new insideBIGDATA feature as a valuable resource for you to get up to speed in this flourishing area of technology. If you have a data science question you’d like answered, please just enter a comment below, or send an e-mail to me at: daniel@insidehpc.com. This week’s question is from a reader who asks for an overview of unsupervised machine learning.

Q: Can you give an overview of unsupervised learning?

A: This is an important question because the most common techniques used in data science projects have to do with prediction using methods such as regression and classification – both falling under the umbrella of supervised learning. These algorithms use labeled data sets to build predictive models that accurately predict new observations. Unsupervised refers to the fact that we’re trying to understand the structure of our underlying data, rather than trying to optimize for a specific, pre-labeled criterion. As such, unsupervised learning provides great potential in discovering patterns in unlabeled data sets that can be used, for example, to construct clusters of similar data or reduce the dimensionality of a data set.

One of the difficulties with unsupervised learning is interpreting the quality of the results. For example, some or all of the clusters output by an unsupervised clustering algorithm might have no value for the intended analysis. The “goodness” of the clusters can be evaluated using metrics such as inter-/intra-cluster distance, but it is still the responsibility of the data scientist to verify interpretability of the results. The key point here is that results from unsupervised learning are often best evaluated by their effect on subsequent analysis.

The value of unsupervised learning continues to grow in response to the need for more robust techniques that can deal with the volume, variety, and velocity of big data. An example of unsupervised learning used in this way is a telecommunications company using a k-means clustering algorithm to segment their customer population into demographic groups. These groups can be used to train a supervised classification algorithm to predict customer churn, which can produce more accurate predictions than if it was trained without customer segmentation. Another example is found with e-commerce websites that want to identify groups of similar customers based on clickstream patterns and purchase histories. These customer groups with similar behavior and/or preferences means a company can execute a more effective targeted marketing campaign. The figure below depicts the process of customer segmentation.

customer_segmentation_vert

Many data science applications use a hybrid learning method that utilizes unsupervised algorithms as a sort of preprocessing step that in turn fuels a supervised learning algorithm. This is commonly found in deep learning and other ensemble learning systems. Unsupervised techniques, such as principal component analysis (PCA), can be used for dimensionality reduction, which reduces the number of feature variables while still being able to explain the variance in the data. The reduced data set can then be used with a supervised learning algorithm. In this way, PCA can improve the learning process.

awwicker

Data Scientist: Dr. Andrew W. Wicker is a Data Scientist with the Graph Analytics Operation team at Intel Corporation. He focuses on researching and developing solutions to problems in the intersection of large-scale machine learning and graph analytics. Prior to joining Intel, Wicker worked as a Senior Computer Scientist at MITRE Corporation, where he employed machine learning techniques to affect policy of government sponsors.

Wicker earned a Ph.D. in Computer Science from North Carolina State University. He has a strong interest in social network analysis, and enjoys using curiosity and creativity to solve problems in a multidisciplinary field.

Speak Your Mind

*