Sign up for our newsletter and get the latest big data news and analysis.

Being Quantitative in Spite of Ambiguity

Dean_Malmgren_DatascopeIn this special guest feature, Dean Malmgren of Datascope Analytics defines two ways to address the ambiguous challenge of defining the problem to be solved through means of data science. Dean Malmgren is a co-founder and data scientist at Datascope Analytics, a data-driven consulting and design firm in Chicago, where he has helped clients like P&G, Daegis, and Thomson Reuters use data to solve the right problem. Dean received a BS from the University of Michigan and a PhD from Northwestern University, has spoken at conferences like Strata, and published peer-reviewed research in journals like Nature, Science, and PNAS that have been featured in places like TIME, Wired, and US News & World Report.

Regardless of whether you describe yourself as a data scientist, BI expert, analyst, or a statistician, we have extremely good techniques for quantifying and dealing with uncertainty. From estimating confidence intervals to evaluating the quality of a model, we have a litany of techniques for addressing our confidence (or lack thereof) in our analysis. The trickier thing is not coming up with an approach to solving a problem, but rather in properly defining the problem itself. This is an inherently ambiguous challenge that manifests itself in at least two ways.

First, even the simplest of problem statements like “quantify our market growth” has several different ways that this problem could be solved. You could use internal data assets like sales figures, you could use third party data assets like social media activity, and you could even use publicly available data assets like census data to start to quantify these things. You could analyze the data with a litany of techniques with tools like time series analysis, supervised learning, unsupervised learning, regression or network analysis at your disposal. And as if you didn’t need another variable to consider, there are a litany of ways to visualize the results of your analysis, from single-number KPIs to full-blown dashboards and visualizations that are intended to identify deep connections between factors. There are thousands and thousands of permutations, each of which might be appropriate for solving the problem at hand, making it difficult to navigate the landscape of approaches at your disposal.

Imagine that you are working for your local park district who asks you to “identify the best locations to plant new trees”; even with this seemingly straightforward problem there is inherent ambiguity in the problem statement itself. What do you mean by “new trees”? What kinds of trees can we plant? Can we move old ones? Are they baby trees, transplanted trees, or something else altogether? And also, what do you mean by “best locations”? Are we trying to optimize foliage in the park, minimize our carbon footprint, optimize for aesthetics or is something else at play here? Depending on the particular goals of your local park district, its culture, and several other factors, the “correct problem to solve” might vary widely. In spite of this ambiguity, we need to quickly hone in on what problems are useful and practical to solve.

As a means to address these inherent sources of ambiguity in data science (or whatever you want to call it), we need to learn from others that have thrived in creating novel, useful and valuable things in spite of the ambiguity in front of them. The human-centered design process is used by designers to invent next generation products and services where none existed before. Agile programming is used by software developers to write code for new systems, features, and architectures where there may not be a readily available solution available. The Lean Startup philosophy has guided countless entrepreneurs to build new businesses and organizations that meet a real market need. Regardless of what you call it, this process involves (1) coming up with a lot of ideas, (2) turning a few of these ideas into bare bones prototypes, (3) getting feedback from people and (4) repeating the first three steps as quickly as possible, ideally in 1-3 week “sprints” and it’s important to learn from this process and adapt it to working with data.


Sign up for the free insideBIGDATA newsletter.

Leave a Comment


Resource Links: