In this special guest feature, Devavrat Shah, professor in MIT’s Department of Electrical Engineering and Computer Science, discusses the type of training data scientists need in order to glean the most value from big data. Shah is co-director of the MIT Professional Education Data Science: Data to Insights course, director of MIT’s Statistics & Data Science Center (SDSC), and a core faculty member at the MIT Institute for Data, Systems, and Society (IDSS). He is also a member of MIT’s Laboratory for Information and Decision Systems (LIDS) and the Operations Research Center (ORC).
All sorts of organizations are interested in using big data to glean insights that help them drive efficiency, increase revenue or otherwise improve their businesses. But these insights don’t simply present themselves; it takes trained data scientists who know how to use the tools of their trade to get the best results.
The operative word there is “trained.” While fields such as statistics, analytics, data mining, pattern recognition and more have been around for some time, the idea behind data science is to employ a number of such disciplines in concert to unearth valuable information. Doing so effectively is not something most engineers will be able to pick up by reading a book or dabbling in it for a while; they need proper training.
To be effective, this training must be multidisciplinary, touching on fields including engineering, social science, mathematics and statistics. Ideally, the training should cover areas including the following:
Unstructured data: Data mining tools have long been able to make sense of structured data, such as that found in databases. But today so much of what constitutes big data is unstructured – Word files, presentations, social media feeds, images and more – and data mining tools are ineffective when it comes to mining these sorts of data sources, as most are specialized to deal with data in a known structure. With effective training in the latest techniques, a data scientist will be able to make sense of this data, finding patterns and structures that were previously unidentified thanks to advances in various machine learning algorithms.
Regression and prediction: Making sense of big data requires the ability to find relationships among variables, often many variables. That means the data scientist must be trained in regression techniques, including bivariate (two variables) and multivariate (more than two variables) regression procedures. Terms like regression trees, boosted trees, and random forests should be familiar. Similarly, familiarity with modern prediction methods is imperative, including the ability to assess prediction performance using validation samples and cross-validation.
Data classification and hypothesis testing: To effectively analyze it, data scientists must understand the various techniques and approaches to classifying data. What’s more, they need to learn how to test hypotheses and detect statistical anomalies, including fraud and other malicious behaviors. They must also understand the limitations of various methods and the dangers of misusing them.
Recommendation systems: We’re all familiar with recommendation systems by now, as they’re an online staple at companies ranging from Amazon and Netflix to LinkedIn and YouTube. Some of them work remarkably well, anticipating what a visitor may want based on past behavior. But designing and building one that is truly useful takes detailed understanding of the principles and algorithms behind these systems.
Graphical models and networks: Graphical models can be a powerful way to understand complex information and facilitate statistical computations. They are an important concept in helping us uncover patterns, function and behavior inherent in networks of information, be it gene regulatory networks or social networks. Data scientists must learn methods for analyzing such networks, which begins with learning how to represent their system as a graph and includes analysis such as centrality measures, influence maximization, and using interference to gain insight on different graphical models. This helps them to find the local interactions that are indicators of large-scale network effects – the kind that businesses care about.
For best results the training should also include case studies that bring home how each of the disciplines are used in practice. The case studies included in this course span all of these areas such as implementing different types of regression to visualize the gender wage gap and playing with deep neural networks to understand how they make decisions. Such case studies can be invaluable in helping data scientists understand how they can put what they learn to use in their own organizations.
A recent Gartner survey found only 41% of IT professionals thought their organizations were ready for the demands of digital business over the next two years, meaning 59% admitted they were not prepared. Don’t let yours be one of them: get your staff some effective training in the data science disciplines that the big data era requires.
Sign up for the free insideBIGDATA newsletter.