Data science is an interdisciplinary field containing processes and systems to extract knowledge and insights from data in various forms, either structured or unstructured. The field is showing itself transformative for a broad range of organizations in the way it delivers real business value based on enterprise data assets. At its core, data science involves using automated methods to analyze massive amounts of data and to extract knowledge from them. With such automated methods turning up everywhere from retail to genomics, data science is helping to create new branches of knowledge discovery and predictive analytics. The trend is expected to accelerate in the coming years as the volume of data grows from sensors, sophisticated instruments, the web, and more.
Although use of the term “data science” has exploded in business environments, many academics and journalists see no distinction between data science and statistics. Writing in Forbes, analyst Gil Press argues that data science is a buzzword without a clear definition and has simply replaced “business analytics” in various contexts. In the question-and-answer section of his keynote address at the Joint Statistical Meetings of American Statistical Association, noted applied statistician Nate Silver said, “I think data-scientist is a sexed up term for a statistician….Statistics is a branch of science. Data scientist is slightly redundant in some way and people shouldn’t berate the term statistician.”
In this article, we’ll make sense of data science for those unacquainted with the field and outline a series of 7 easy steps to get up to speed with the technology. In doing so, we’ll highlight the integral steps in the “data science process,” so you can get a good grasp of how data science works and how it is of value to enterprises seeking to maximize the value of their data assets.
Understand the problem – It is important to fully understand the problem domain, enlisting the help of domain experts to explain business considerations, provide data sets, define key variables, and most importantly, state the goals of the project, i.e., what is to be predicted or discovered. A recipe for disaster with any data science project is to start without clear goals. If the patron of the project simply says, “Here’s some data, now go work your magic,” you might as well run for the hills. It will be very difficult to provide results with such unrealistic expectations. Instead, it’s necessary to take time during this integral step of the process and make sure that everyone who is invested in the success of the project weighs in on the goals. Depending on the project, you may need to interact with people in marketing, finance, operations, IT and even human resources; you’ll often get input from more than one department.
The right way to approach data science is to start with a problem that has a bottom-line impact on your business, and then work backward from the problem towards the analysis and the data needed to solve it. Insights don’t happen in a vacuum – they come with the hard work of analyzing data and building models to solve real problems. Sometimes, the link between the business problem and the application of data science will be very clear, as in the case of correctly identifying fraudulent credit card transactions. In other cases, there can be multiple steps that separate the business problem from the data science application. The genesis and ultimate driving force for a data science project, more often than not, is based on a department-level need (e.g. finance, sales, marketing, etc.), and not the IT department, although they may need to provide you with the data set(s) you’ll use for machine learning.
Locate and secure the appropriate data sets for the problem being solved – The data for analysis may come from a data warehouse, data mart or even data lake. In some cases, the data may be an extract from a production system, e.g., an e-commerce application. More and more these days, the data for a data science projects come from a variety of sources, including unstructured sources such as social media, or even e-mail. Networks of Internet-of-Things (IoT) sensors represent another source of data. It is for this reason that data acquisition can require a degree of creativity, again with the help of domain experts (like a controller, who may provide you with sales or commission data). This step may require involvement with the IT department, which may enlist the services of an extract-transform-load (ETL) engineer who can cut a data extract for you.
Data transformation – The task of data transformation is very important early on in the project in order to clean and transform the raw data into a form more suitable to machine learning. Given the state of some enterprise data (dirty, inconsistent, missing values, etc.), this step may take considerable time and effort, often up to 75% of the time and cost for a data science project. It is important to document all the data transformation taken as this process becomes part of the reusable data pipeline.
Exploratory data analysis (EDA) – Use statistical methods and data visualization to discover interesting characteristics and patterns in the data. Sometimes simple plots of raw data (or samples from raw data) can reveal very important insights that will help dictate a direction for the project, or at least provide critical insight that can be useful when interpreting the results of the data science project. EDA can help determine the optimal feature variables (predictors) to use for the particular machine learning algorithms you intend to employ for the project. This step may require discussions with domain experts about what surfaces during EDA. Certainly, you’ll need to fully understand the extent to which each feature variable may contribute to the prediction accuracy of the algorithm.
Model selection – Choose a machine learning algorithm appropriate for the problem being solved, and split your data into training, cross validation and test sets. At this stage you need to make a commitment to the type of machine learning you’ll use. Are you going to make a quantitative prediction, a qualitative classification, or are you just exploring using a clustering technique? After you gain experience with machine learning, you’ll be able to more readily identify the algorithm most appropriate to use for a particular application. The culmination of this step is to train the model with the training data set.
Model validation – There is no single algorithm that bests all others over all possible data sets. On a particular data set, one specific method may work best, while another method may work better on a different data set. Hence, it is an important task to evaluate which method produces the best results for any given data set. Selecting the best approach can be one of the most challenging parts of the data science project in practice. As a result, performance evaluation of a model is critical to the success of the project. We need to measure how well its predictions actually match the observed data. That is, we need to quantify the extent to which the predicted response value for a given observation is close to the true response value for that observation. We also need to determine the degree of overfitting that occurs, i.e., when a given algorithm yields a small training error, but a large test set error. Specifically, we need to evaluate how well the model generalizes.
Data storytelling – The final results of a data science project can best be understood with well-crafted visualizations that capture the essence of what the algorithm is telling us about the data. Visualizations that communicate the proper message are not easy to create and may require several tries to be successful. In fact, building effective visualizations requires a certain creative and artistic flair. Fortunately, the Internet is packed with plenty of examples of effective visualizations that you can use to come up with a direction for your own.
Once you’ve completed the data science process, you’re ready to communicate the results to the organization’s decision makers. In order to be effective, you need a “data storyteller,” someone who can tell a compelling tale based on what the data says. Realizing that most managers won’t have a background in data analysis or statistics, it is the job of the data scientists to bring it all down to a form that is understandable by a typical business person. In order to take action on the results, the decision maker must truly understand what’s being communicated, not in a data science sense, but rather in an actionable business intelligence sense. It’s not easy, but telling the story of the data is an integral part of data science.
After completing these steps in the data science process, the project doesn’t end there. It’s important to repeat the process with new insights. This involves: finding new data sets that contribute to the solution of the problem, retraining the algorithm with more data, looking at the predictive power of different feature variables, evaluate different models, explore new metrics for evaluating predictive accuracy, etc.
Data science can be very enjoyable if you’re the naturally inquisitive type, but it can become frustrating, since the true purpose of a data scientist is to continuously prove your own work wrong time and time again. Data science is about finding new answers to existing problems on a regular basis, and being “right” is equivalent to plateauing. That means taking apart your work constantly and attempting to find holes and logical fallacies at every step, as well as criticizing it from every angle, and being open to feedback and even criticism from others.
Once the project is complete, an important “next-step” is to deploy the solution in a production environment. The form this environment takes is dependent on the goals for the project. If you’re working on a recommender system, then deployment might mean adding functionality to an e-commerce website. If you’re working on a churn rate predictor, then deployment might mean an addition to a marketing system. Or if you’re working on a project to help close sales deals, then there might not be any sort of actual deployment, but rather a change in the sales department SOP document (e.g., continuing to call on potential customers after the 7th call is a point of diminishing returns).
Contributed by Daniel D. Gutierrez, Managing Editor of insideBIGDATA. In addition to being a tech journalist, Daniel also is a practicing data scientist, author, educator and sits on a number of advisory boards for various start-up companies.
Sign up for the free insideBIGDATA newsletter.