Sign up for our newsletter and get the latest big data news and analysis.

The Growing Challenge of Data Democratization

In the US, we are enduring a decades-long debate about the present and future of our healthcare system. We have a lot of issues with current healthcare delivery, but one of the primary ones is the bottleneck that we experience as consumers where limited numbers of experts are available to those who need them.

Some solutions to this problem aim to improve the efficiency of the way that expertise is delivered. Others seek to bring the expertise of healthcare professionals into wider distribution through technology and automation. This problem, and the approaches to solving it, should sound familiar to data science professionals.  Even though we are educating an ever-increasing number of future data engineers and scientists, there is a limited number of experts available to create the analysis and applications the market wants.

In order to overcome this resource constraint, businesses have two main options. One is to streamline and manage the data science process to maximize output through increased efficiency.  The other option is to implement tools that claim to democratize data science, thereby distributing the means of insight production to many more of the workers in the company.   This second approach seems appealing, as the idea of empowering more of your workforce to do high-leverage data science work is certainly attractive.  However, it has a set of pitfalls that need to be navigated before it can deliver on its promise. 

Why Should We Democratize Data Science?

It’s human nature to try to improve the throughput of systems that produce value, whether it’s healthcare, agriculture, software engineering, or data science and machine learning. There are many examples of efforts to encourage untrained individuals to participate in making the products these industries generate.  First aid training, home and neighborhood gardens, and efforts in natural language programming like HyperTalk are all examples of distribution of a key capability to more people.  In order to do the same for machine learning and data science, business leaders rely on their understanding of what data science teams do in order to distribute the capability over a wider range of people.  Because data scientists build models for prediction and classification, the thinking goes, if tools are provided that automate the selection and creation of models, the data science skillset will be effectively granted to anyone who can use the tool.   The problem is this logic relies on a fundamental misunderstanding of the key skills of a good data scientist.  Model building is important, but less and less time is spent by data scientists tuning models. Data structuring, gathering, and cleaning, formulation of data science questions, and communication and production delivery of data science insights are all more important than squeezing a bit more accuracy out of a model.  In continuing with the healthcare perspective, just as having access to WebMD might not fix the healthcare bottleneck, automated ML tools that don’t require the user to understand the methods or reasoning for their use might not solve the analogous AI and data science problem. This is especially true when considering the full range of skills that are present in good doctors, both in terms of technical and soft skills. Data Science democratization holds a lot of promise, but only when it is used in environments where the appropriate technical and non-technical guidance is available.   

Eliminating Cognitive Bias

This increased access to rapid interpretations of data can lead an organization and its employees to feel empowered. But this empowerment can quickly turn to disillusion if the system has data quality issues. Trust can be lost when the exposed data is perceived to be not credible or becomes influenced by others cognitive biases. We are all familiar with the idea of how terrifying it is to browse Google for your health symptoms in order to understand what is causing them.  We lack perspective in understanding the context for the relative frequency of diseases, and we don’t have the training required to evaluate our symptoms dispassionately.  Even with training, it is often difficult for physicians to overcome the very human biases we all are subject to that may cause diagnostic or treatment errors. Sadly, data science is no different. The purpose behind all the statistics and experimentation is to counteract cognitive bias.  Putting data science tools in the hands of users that aren’t trained in scientific and statistical methods is like that Google search of your health symptoms. You’ll find whatever it is you’re looking for, even if it isn’t likely the correct interpretation of your data. 

Improving Data Management

Another popular idea is to allow non-experts to handle the high volume of trivial cases, in order to permit the experts to allocate their time more appropriately to the more challenging edge case or high-leverage problems. This idea is well intentioned, but the real difficulty gets shifted to the classification of the problem. How does one learn which problem classes are suitable for a boilerplate solution, and which ones aren’t? How do we know when a headache should be treated with ibuprofen or met with follow-up CAT scans? Both systems and individuals are generally trained to behave risk-aversely in this scenario, which ends up doing little to solve the bandwidth problem we’re trying to solve. Where does that leave us?  Are we hopelessly out of luck when it comes to automating machine learning?  Of course not. The role of a data scientist will inevitably shift and specialize as the field matures, as will the toolsets.  What really needs to mature first, however, is our data management.  If we can store, document, and manage data in a way that optimizes for the minimum time required to make it useful, we greatly reduce pain for data professionals all the way up and down the data stack.  Eliminating this headache also leads to an increase in efficiency; less time locating, cleaning, and joining data, and more time creating tools that serve the customers and business units that depend on us.     

Looking to the Future

Data science democratization is coming, because the potential is impossible to ignore. The difference between companies that handle it well and those that don’t will be dependent on the way that new practitioners are guided and advised. Data scientists are responsible for educating these new users about the science that goes into designing experiments, and why it is that building an accurate model on a training set isn’t enough. By empowering their fellow data professionals with the analytical and soft skills required to solve data problems successfully, they will be opening up deeper opportunities for themselves and their organizations.

About the Author

Fred Frost is the Lead Data Scientist at SentryOne, where he and his team apply data science and machine learning techniques to improve the quality of life for Microsoft data professionals and their customers through performance optimization, forecasting, and troubleshooting.  Fred’s diverse training in neuroscience, physics, and genetics fosters an approach to learning from data that focuses on merging theory and epistemology while using the scientific method to solve problems. 

Sign up for the free insideBIGDATA newsletter.

Leave a Comment

*

Resource Links: